- I love my MacBook Pro M5 128GB RAM and I love qwen3.6.
BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.
Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.
If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.
Thank me later.
by bensyverson
30 subcomments
- The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
- None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
by doodlesdev
9 subcomments
- I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
- I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).
I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.
I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.
I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.
by mashygpig
5 subcomments
- It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison.
Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...
by cpburns2009
0 subcomment
- Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.
- I'm having a decently good time time with `qwen3.6-35b-a3b-mtp` (unsloth's multi-token prediction version) and and `qwen-agentworld-35b-a3b`.
On a 2021 M1 Pro (32GB RAM) I can get either of them as `IQ4_NL` quantized models (the first with reduced context, around 160k; the second can do the whole 264k with RAM left over), running something like 30tokens/s.
On a Framework 13 AMD AI HX370 it can use the same, but both on Q8_0 quantization, full context window, parallelism. Speed is just ~15tokens/s so slower, but definitely smarter than the lower quantized siblings.
Both of them are good developer partners for an engineer who wants more of a second pair of eyes and a rubber duck, rather than a model to just do everything for them. Pretty good for my brain dumping, some commit reviews, sanity checks, just always assume that every claim has to be checked and re-checked.
The only problem is really the context loading, that's pretty slow (starts off around 300token/s on empty context, by the time we get to something like 70-80k which is just a bit of repo discovery, it can run around 80 prompt token/s or less, so there's always a lot more waiting around. Local tools need to bump all of their timeouts, and have to be mindful that there's unlikely to be really meaningful parallelism on these machines with local models.
I'm still figuring out how to approach these things, though. Definitely better than glorified autocomplete or search tool (and too slow for the former, pretty decent for the latter). Their limited skill and performance make it more in line with other tools like my IDE or editors, that they are still in the "tools" compartment of my thinking, rather than "independent, cognitively active entities". Which feels like a good thing.
by beastman82
3 subcomments
- FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
by 0x0000000
8 subcomments
- > ... on my Macbook Max M5 128 GB
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
- Considering the cloud version, all three models compared in the article (Qwen 3.6 35BA3b, 3.6 27B and DeepSeek V4 Flash), have very similar performance[0], BUT on cloud, for some reason DeepSeek V4 Flash is 10-20x cheaper than the Qwen models.
If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?
[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol
- I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.
by starefossen
0 subcomment
- We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace
by mips_avatar
1 subcomments
- I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.
by jimmaswell
2 subcomments
- My partner has been trying various models on our server but we haven't gotten anything to run at a usable speed. Q30H engineering sample (Xeon 8570) with two cpus, 56 cores per CPU, 768GB DDR5 RAM running at 5600MHz, two old 3090s in it at the moment with an NVLink and we could put our third in there. We built this server before the prices skyrocketed because we happened across some Tyan boards on Woot that were absurdly cheap for what they are (the motherboards should be $1000+ but we got them for a few hundred).
This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine.
Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.
We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.
by rhgraysonii
2 subcomments
- I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?
by androiddrew
1 subcomments
- Dual AMD Radeon AI Pro 9700s (600 watts total 64GB of vram) runs Qwen 3.6 27B at FP8 with mtp on vLLM at 50ish TPS for decode. Cards cost $1300 a piece. Enough KV cache to fully max out two concurrent sessions.
It was super rough going to get started with them back in January, but right now the cards purrrr and I haven't even tried tuning yet. You need to use a patched vLLM image with aiter but besides that things are finally working on the ROCm front.
- Running 27B dense model on M5 128GB is ok, but one can do better.
On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.
27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.
- I don't understand the talk about how expensive the hardware is. These models can run on very old or old and low end. I've been running Qwen3.6-35B Q4 on an old 1080 GPU(8GB vram) with 32GB sys RAM. I have a i7-12700.
It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.
I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.
These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.
- Which one is actually better between Qwen and DeepSeek, and which one costs less?
by RedCinnabar
4 subcomments
- Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
- > What it does:
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
by _tyiueojdfg4
0 subcomment
- My personal experience below:
I ran into some small problems with codex during setup and, for a few reasons, did not want to set up a cli shell with them at the time. Since I was not doing anything really serious, but just exploring a half-baked idea for an android app, I ran qwen in lms and connected it to android studio.
None of the mini projects that I have attempted ( more granular call control, silly html scrolling game, music play app ) were one shots despite carefully preparing the prompt ahead of time. Admittedly, some of it may have something to do with android studio, but I did not try it with google account yet. All took between an hour to four to generate ( prep, initial run, testing, iteration and so on ).
If it helps, miniforum AI MAX 395. I am not saying it is bad. Quite the opposite, but you want to be aware of the limitations though and plan around those.
by HotGarbage
1 subcomments
- And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.
- I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
- Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."
72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.
That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)
- I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.
Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
by schmuhblaster
1 subcomments
- I've worked extensively with the slightly less able cousin, the 35B A3B model and tuned my own harness around making it work well with local or non-sota models. The results are quite promising [0], if one sticks to a plan-execute approach. After a bit of fiddling with llama.cpp I was able to get it to work through a small change on a real codebase from work on a 32GB M5 (typical python FastAPI backend, so nothing out of the ordinary). While that's somewhat encouraging, the whole local experience was still far from pleasant with all the noise and heat.
[0] https://deepclause.substack.com/p/how-to-make-small-models-p...
- i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.
don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.
- Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?
by MangoCoffee
4 subcomments
- Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.
by Otternonsenz
6 subcomments
- Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
by decide1000
0 subcomment
- A lot of replies here are about Mac devices and their support for these 27B models. I own a MacBook but use a Lenovo Thinkstation PGX to run my models. It has a gb10 Blackwell gpu and 128gb unified memory. You can connect multiple ones.
- I have a fairly beefy M4/48G but I haven't been able to get any local model to behave anywhere near satisfactorily.
- I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.
Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.
by hollowturtle
0 subcomment
- > Real work
Ok that's the part I'm interested in, don't care about minesweeper clones....
> Make a landing page selling candles for women that are into wellbeing and SPA.
can't be serious...
by fabijanbajo
0 subcomment
- We need machines designed around wide memory + sustained inference thermals, not gaming/creator chassis we're borrowing. Until then "local dev" means clamshell + external fans.
- When is Amazon Bedrock going to get these newer models?
Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.
by SamInTheShell
0 subcomment
- This is probably the first small model I got through some simple web game tests without having to reset the context. It tends to opt to overwrite an entire file instead of doing edits... which editing is where most of these small models fall apart along with getting stuck in repeating loops. Only 24k tokens in so far, it did some decent newbie work.
by prasanthabr
5 subcomments
- Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?
- On dual rtx3090 it runs at 140tok/s with a short prompt... Not bad.
Qwen 3.6 dense runs at 40tok/s
by mark_l_watson
1 subcomments
- I can come close to agreeing because queen-3.6-27b is my second favorite for local coding. I am using gemma4:26b-a4b-it-qat-48k (the "-48k" is from my modifying a model run with Ollama to always use a 48K context size). On a 32G Mac I use gemma4:26b-a4b-it-qat-48k and OpenCode and on my 16G MacBook Air I use gemma4:12b-it-qat-16k ("-16k" is my resizing context size) and little-coder. I break up projects into small libraries because local coding works better for me using small code bases.
I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.
To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.
- What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.
I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.
- I've been using it with a couple of tools (like context7) as a documentation/helper, without giving it direct access to writing code, in marimo. it works great, albeit a little slow on my server (m1 max 64gb ram), at 8bit with omlx
by trey-jones
0 subcomment
- Qwen3.6 was the first model I ran locally that seemed smart, but qwen3-coder:30b is way, way more responsive and more accurate for writing code according to my tests, including human-eval. If you can run one than you can almost certainly run the other. If you haven't tried qwen3-coder I would definitely recommend it. It feels positively snappy compared to every other local model I've tried. All you need is 32G VRAM and some heat dissipation.
- I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.
https://pi-local-coding-bench.dev
by SkitterKherpi
0 subcomment
- 27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.
by simplyluke
1 subcomments
- The open source models have gotten heavily conflated with local development. While that is cool and I'm excited about the future of local LLMs, it is not necessary to play around with these models. Without shilling for companies I don't have a relationship with, there are a number of companies who will give you an API just like Anthropic/OpenAI and you pay per token, albeit much cheaper than the frontier labs.
I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.
by PeterStuer
0 subcomment
- Been running it on a 9950x3D with 96GB and a 4090. Speedwise it is fine. Bit while not completely useless, for software development it is unsurprisingly a dramatic downgrade from the Opus I use as my daily driver.
- What model fits on 36GB RAM mac?
by meta-level
1 subcomments
- why does everyone imply you need a $10k laptop which then starts burning when you run Qwen 3.6? Get any other system with enough VRAM for a third of the price. Framework Desktop (Strix Halo 128GB) still costs under 4k nowadays, is nearly silent even on 100% GPU + CPU. (also it gets only slightly 'warm', but with a desktop you don't care anyway, I guess).
- I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!
by zedascouves
0 subcomment
- Just tried on some arduino code. after 10 minutes i got a list of improvements to my code.
I ran those throu opus saking if it was good advice and was not impressed:
I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like
generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already
in your file, and its headline "critical" claim misreads what the code does. Going point by point:...
- How you can do dev in 2026 using 64k context and without sub agents?
The benchmark seemed fine until I saw that.
If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.
If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.
by diseasedyak
2 subcomments
- I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.
For anything else local, including writing some automation scripts and such, it works great.
by letmetweakit
0 subcomment
- Any chance to run this on a RTX 3090 and 64GB of regular RAM with decent context size?
by christoff12
1 subcomments
- I just burned 20 minutes because I wanted to play hex minesweeper: https://hexabomb.pgpln.app
Source: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...
by kristopolous
0 subcomment
- Help me improve local model performance with petsitter!
It basically exploits the face that time can be traded for intelligence with local models
https://github.com/day50-dev/Petsitter
by markdog12
1 subcomments
- I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.
- Tried looking at it, but needs a much beefier machine than I have RN.
Hopefully we're looking at a future where local models become more & more realistic to use for reducing remote AOI spend.
- Has anyone managed to cleanly integrate Web search into local models (run with llama.cpp)? The biggest limitation of the class of models that fit into one or two consumer GPUs is that they lack world knowledge, but presumably this can be remedied by enabling access to use the Internet.
- How does llama.cpp use the GPU efficiently as opposed to MLX?
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
by recursivedoubts
1 subcomments
- I would like to offer someone the next openclaw: a GUI for the mac that allows people to manage and install local models with a single click, provides GUI tools for tweaking important aspects of them, and also provides a good command line interface to those models.
by drillsteps5
1 subcomments
- I honestly don't get the hostility against local models in this thread (and in some other threads recently).
I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.
You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).
What exactly is wrong with any of that?
- Lost count of number of times I read this or similar:
For me it’s the first local model that actually makes sense as a general intelligence.
- In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.
- > I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.
> https://sleepingrobots.com/dreams/stop-using-ollama/
I had faced roadblocks while integrating with openclaw using ollama (Was trying to experiment with `qwen3-vl:2b`). I was tracking the issue back to openclaw at that time, I didn't even consider investigating ollama.
I attached a threads post here where I'm talking to meta ai to expand on both scenarios (not to use ollama, but llama.cpp & my take on the why this is the way it is - ie. commercial gains)
https://www.threads.com/@riojos/post/DaMXIs4k4w8
- Checkout details on what this runs on for local AI here:
https://tokenstead.ai/models/qwen3-6-27b
- 3.5 122B is much better. 27 B is bad at Long context and Svelte
by macwhisperer
0 subcomment
- hi guys... I run specialized quants on my 24gb air.. (I specialize in 3-bit quants that punch above their weight).. try out my version of 3.6-27b I think you be impressed https://huggingface.co/macwhisperer/Qwen3.6-27B-SuperDense
- Best way to make your M series macbook pro feel like a good old fashion intel macbook pro. Run a local model.
- Running this model on a 48 GB memory MacBook Pro when offline, it performs its tasks, but of course, it’s slower than Claude or Codex.
- On AMD R9700, I'm getting ~90 t/s with 35b MTP variant and ~40t/s with dense 27b MTP
- qwen 3.6 27b and qen35b a3b work like magic, if we get dpark speculative decoding versions of these models it will further improve the throughput
by cloudengineer94
0 subcomment
- I'm using Qwen and Gemma 4 locally and it's pretty good stuff, not frontier level but gets the job done.
- Its feasible but that laptop is not available for most devs.
I do have access for a 64 gb ram mac mini but most people don't.
by alansaber
2 subcomments
- Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?
by macwhisperer
0 subcomment
- also for those with only 16gb-- try this model https://huggingface.co/macwhisperer/Gemma4-12B-SuperDense its exceptional!
- Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
- >Real work
This part should have featured something about real work. But instead it features a paragraph about one-shot bs that creates "something".
Unless your work is to create thousands wordpress tremplates to sell - this is not a "real work".
Give it a repository (any kind of OSS project will do for an example) and a github issue requesting a knew feature or describing a confirmed bug. (you can and probably should write a prompt for LLM shough, don't just provide the issue itself)
And then whatch it go.
And then judge the result and it's quality.
Sorry, but from my experience 27B is just useless. You do get a result and some times it does work, but most of the times it is not event on junior dev level. And it takes it a lot of time to do the thing, unless you have an extremely expensive machine.
by felooboolooomba
0 subcomment
- What's the minimum requirement for a Nvidia card to run it? For let's say 10 t/s.
- Yup, been rocking theQwen3.6-35B-A3B-MTP-GGUF locally with 88tk/s it's amazing.
- If I have 10k to spend, what should I buy for the best local model experience?
- FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?
- I see OpenCode mentioned in the article, and I would strongly warn against using it for local development because it disrespects caching (the content of the first turn / system prompt is NOT stable). I use Pi which works much better.
- Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.
by cat_plus_plus
0 subcomment
- Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)
- Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3B
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
by ascii0eks84
1 subcomments
- Very capable lora adapters are surfacing but it seems they are very niche.
- When reading the comments, it seems that in the AI race to zero, Apple was already at the finish line. as predicted.
So it will be no surprise that there will be a time where everyone will be able to run a local model, say GLM 5.2 locally on their machine. Like it or not.
- Qwen is so good a model.
- Hmm, i used it and it can't get past a simple coding test that 5.5 passes with light reasoning
- none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
- goat
- Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
by john-frandsen
0 subcomment
- [flagged]
by ermantrout
0 subcomment
- [flagged]
by Nasser_CAD
0 subcomment
- [flagged]
by yashthakker
0 subcomment
- [dead]
by cloudcanalx
0 subcomment
- [dead]
- [flagged]
- [dead]
by Frankybeatz
0 subcomment
- [flagged]
by ShizuhaLabs
0 subcomment
- [flagged]
- [flagged]
- [dead]
by Getchowned
0 subcomment
- [dead]
by dhanush_2905
0 subcomment
- [dead]
by suthakamal
0 subcomment
- [flagged]
by CurbStomper
0 subcomment
- [dead]
by Reuben_Santoso
0 subcomment
- [dead]
by sourcegrift
0 subcomment
- [dead]
- This is kind of like saying grass is green to be honest