- Getting so close to good!
I consider Gemma 4 31B (dense / no MoE), the new baseline for local models. It's obviously worse than the frontier models, but it feels less like a science experiment than any previous local model I’ve run, including GPT OSS 120B and Nemotron Super 120B.
On my M5 Max with 128 GB of RAM and the full 256K context window, I see RAM use spike to about 70 GB, with something like 14 GB of system overhead. A 64 GB Panther Lake machine with the full Arc B390, or a 48 GB Snapdragon X2 Elite machine, could probably run it with a 128K to 256K context window. Maybe you can squeeze it into 32GB (27.5GB usable) with a 32K context window?
Even last year, seeing this kinda performance on a mainstream-ish/plus configuration would have seemed like a pipe dream.
by sourcecodeplz
2 subcomments
- Running LLMs local is fun and powerful but if you want to get work done... it is a big headache. You have to pre-plan and plan, and make specs, etc... The big OpenAI, Claude models just get you with just a few sentences..
- I could have used this article before I spent the weekend arriving to the same conclusion!
Same laptop, and my contrived test was having it fix 50 or so lint errors in a small vibe-coded C++ repo. I wanted it to be able to handle a bunch of small tasks without getting stuck too often.
GPT OSS 20B was usable but slow, and actually frequently made mistakes like adding or duplicating statements unnecessarily, listing things as fixed without editing the code, and so on.
Qwen 3.5 9B with Opencode was much faster and actually able to work through a majority of the lint warnings without getting stuck, even through compaction and it fixed every warning with a correct edit.
I tried 4bit MLX quants of Qwen 3.5 9B but it eventually would crash due to insufficient memory. I switched to GGUF, which I run with llama.cpp, and it runs without crashing.
It is absolutely not comparable to frontier models. It’s way slower and gets basic info wrong and really can’t handle non trivial tasks in one go. I asked it for an architecture summary of the project and it claimed use of a library that isn’t present anywhere in the repo. So YMMV, but it’s still nice to have and hopefully the local LLM story can get much better on modest hardware over time.
- I think it's useful to be realistic about what you can do with a local model, especially something as small as the 9B the author is using. A 9B model is around the level of Sonnet 3.6 - it can do autocomplete and small functions but it loses track trying to understand large problems.
But the are interesting and fun to play with! I do a LOT of work on local agent harnesses etc, mostly for fun.
My current project is a zero install agent: https://gemma-agent-explainer.nicklothian.com/ - Python, SQL and React all run completely in browser. Gemma E4B is recommended for the best experience!
This is under heavy development, needs Chrome for both HTML5 Filesystem API support and LiteRT (although most Chromium based browsers can be made to work with it)
It's different to most agents because it is zero install: the model runs in the browser using LiteRT/LiteLLM (which gives better performance than Transformers.js), and Filesystem API gives it optional sandbox access to a directory to read from.
It is self documenting - you can ask questions like "How is the system prompt used" in the live help pane and it has access to its own source code.
There's quite a lot there: press "Tour" to see it all.
Will be open source next week.
- Critics are (rightly) pointing to the fact that these models are not on par with SOTA for complex coding tasks. But many seems to forget that a large part of white collar office work is Excel crushing, file moving, translating dry legal documents, e-mail drafting, PPT drudgery, etc. These are absolutely doable with 30-35b+ models with the added benefit of keeping company data private.
- I am running qwen 3.6 9b quantized model on my m4 pro 48gb and it is barely useful to do some basic pi.dev/cc driven development. I think 128gb desktops are the sweet setup to actually get meaningful work done. However, getting your hands on one of these machines is difficult at the moment.
As much fun as it is to run these things locally don’t forget that your time is not free. I am slowly migrating my use cases to openrouter and run the largest qwen model for < $2-3/day with serious use for personal projects.
- Thanks for sharing. I made a post earlier on bluesky describing my random setup on 32gb M2 studio. I'd love feedback. I'm a monkey and if I don't see I can't do.
https://bsky.app/profile/mooresolutions.io/post/3mliilyf2i22...
by busfahrer
1 subcomments
- I am considering a M5 Pro (18/20C) Macbook with 64GB of RAM, but I'm having a really hard time finding benchmarks of real world models:
Could somebody please provide some tokens-per-second numbers for example for Qwen 3.6 35B/A3B, specifically for Q4 and Q6 quants?
- I got qwen3.6:27B running on my 4090 (24GB) with ~128K context leveraging some of the recent turboquant/rotorquant memory optimizations for activations. Highly suggest going up to that. the q4_xl+rotorquant combo is pretty good.
Some reference code if you want to throw your agent at it.
https://github.com/rapatel0/rq-models
by isaisabella
0 subcomment
- I'd rather spend thousands dollars on a Mac than subscribing API. The local model allows me to do my work any time and anywhere, without worrying about privacy leak.
- Recent models (Qwen 3.6 and Gemma) can really do coding locally. Feels like SOTA from maybe a year ago? But you would want about 32-40GB total memory. 24GB is just a bit short of that. A gaming PC with 16GB graphics card and 32GB RAM brings you very close to a usable coding system.
- Having an M3 with 36 GByte I was under the assumption, that I can utilize like Qwen and similar models. It's quite easy to set up, you can use pi or hermes for CLI access, or "Continue" to use it in VS Code. You can choose between omlx, Ollama and even more to run the model itself. It's no rocket science, but the results are also not satisfying.
I use it occassionally for very easy tasks, fix typos or update meta data in blog posts. So yeah, it improves productivity. But coding-wise it's far away from Codex, Claude et al.
- Still trying to understand if a Macbook Pro M5 Max with 128GB is likely going to be able to run coding models well enough that I can cancel my Codex, or even go down to the $20/month plan.
- Beyond the models getting better; there are still huge gains available in the inference engine side with new tricks like Dflash, MRT, turboquant - for some usecases these can multiply the speeds. There are even some model specific optimized kernels like for DeepSeek 4 flash that seem wild.
Makes me feel we are nowhere near the optimum yet.
Examples: https://dasroot.net/posts/2026/05/gemma-4-speed-hacks-mtp-df...
https://x.com/bindureddy/status/2052982206344409242?s=46
- I'd pick a much more open system with more capabilities for a little bit more money, e.g. a Jetson Orin 64GB (unified memory). Runs Linux out of the box.
- How about a M4 with 16GB of memory?
by MinimalAction
1 subcomments
- Well, but if I have a MacBook Air M4 with 16GB, I don't know what useful models can I run.
- What kinda harness do people use with these local models? I am quite happy with the Claude Code permission model and interface in general for coding stuff (For chat-y interfaces I have no real opinion)
by kristianpaul
0 subcomment
- Good to keep hideThinkingBlock default, is on purpose to be able to steer de model.
- I'll have to try some more. I've been playing with gpt-oss 20b on my M4 24GB but it hasn't been the best experience.
- so, interested how many people are running higher end AI models locally? Figure if I'm spending $800/month on tokens I can build a pretty beefy local machine for the cost of a few months spend - what is people's experience with say a $5k server custom built (and only for) running an AI model.
- "What does work is a more interactive workflow where you’re clearly communicating with the model step by step, and giving it a lot of guidance. I’m sure that sounds pointless to many of you, why use a model where you have to babysit it as it works, but I actually found that it encouraged me to be more engaged. "
This sort of thing is key to knowing what's going on and bit having your brain fully atrophy.
by BubbleRings
1 subcomments
- People do use SOTA LLM’s for other things besides computer programming.
For instance, if you are an independent inventor trying to write a patent while keeping your patent lawyer expenses to a minimum, you want to write as much of the first draft(s) of the patent as possible yourself. (You’ll save billable hours with your patent lawyer, and you’ll end up with a better patent because you’ll communicate your innovations more clearly to your lawyer.)
However, and this is the big thing, you absolutely do not want to be asking a SOTA LLM for help with the language in your patent application. This is because describing your invention to a web based LLM could be considered a public “disclosure” of your invention, which, (after a one year grace period goes by), could put your invention in the public domain, basically… and thereby prevent you (or anyone else) from being able to ever patent the invention. Plus, you know, a random unscrupulous employee at the SOTA company could be reviewing logs and notice your great idea, and file a patent on it before you do. Remember, the United States patent office went to “first inventor to file” in 2013.
Oh and don’t take legal advice from random people on the internet by the way.
by redsocksfan45
0 subcomment
- [dead]
- [flagged]
- [dead]
- [dead]
- I'm puzzled. The M4, as far as I know, doesn't have 24GB. Did the author mean a M40?
- A useful data to know about this setup is how many tokens/sec generates.
- The site does not have ssl. Please can you enable it so that I can read the article?