Are people really doing that?
If that's you, know that you can get a LONG way on the $20/month plans from OpenAI and Anthropic. The OpenAI one in particular is a great deal, because Codex is charged a whole lot lower than Claude.
The time to cough up $100 or $200/month is when you've exhausted your $20/month quota and you are frustrated at getting cut off. At that point you should be able to make a responsible decision by yourself.
That said, the privacy argument is compelling for commercial projects. Running inference locally means no training data concerns, no rate limits during critical debugging sessions, and no dependency on external API uptime. We're building Prysm (analytics SaaS) and considered local models for our AI features, but the accuracy gap on complex multi-step reasoning was too large. We ended up with a hybrid: GPT-4o-mini for simple queries, GPT-4 for analysis, and potentially local models for PII-sensitive data processing.
The TCO calculation should also factor in GPU depreciation and electricity costs. A 4090 pulling 450W at $0.15/kWh for 8 hours/day is ~$200/year just in power, plus ~$1600 amortized over 3 years. That's $733/year before you even start inferencing. You need to be spending $61+/month on Claude to break even, and that's assuming local performance is equivalent.
With LLMs, I feel like price isn't the main factor: my time is valuable, and a tool that doesn't improve the way I work is just a toy.
That said, I do have hope, as the small models are getting better.
Somewhat comically, the author seems to have made it about 2 days. Out of 1,825. I think the real story is the folly of fixating your eyes on shiny new hardware and searching for justifications. I'm too ashamed to admit how many times I've done that dance...
Local models are purely for fun, hobby, and extreme privacy paranoia. If you really want privacy beyond a ToS guarantee, just lease a server (I know they can still be spying on that, but it's a threshold.)
The above paragraph is meant to be a compliment.
But justifying it based on keeping his Mac for five years is crazy. At the rate things are moving, coding models are going to get so much better in a year, the gap is going to widen.
Also in the case of his father where he is working for a company that must use a self hosted model or any other company that needed it, would a $10K Mac Studio with 512GB RAM be worth it? What about two Mac Studios connected over Thunderbolt using the newly released support in macOS 26?
LM Studio can run both MLX and GGUF models but does so from an Ollama style (but more full-featured) macOS GUI. They also have a very actively maintained model catalog at https://lmstudio.ai/models
Just as we had the golden era of the internet in the late 90s, when the WWW was an eden of certificate-less homepages with spinning skulls on geocities without ad tracking, we are now in the golden era of agentic coding where massive companies make eye watering losses so we can use models without any concerns.
But this won't last and Local Llamas will become a compelling idea to use, particularly when there will be a big second hand market of GPUs from liquidated companies.
I found the instructions for this scattered all over the place so I put together this guide to using Claude-Code/Codex-CLI with Qwen3-30B-A3B, 80B-A3B, Nemotron-Nano and GPT-OSS spun up with Llama-server:
https://github.com/pchalasani/claude-code-tools/blob/main/do...
Llama.cpp recently started supporting Anthropic’s messages API for some models, which makes it really straightforward to use Claude Code with these LLMs, without having to resort to say Claude-Code-Router (an excellent library), by just setting the ANTHROPIC_BASE_URL.
This is something that frequently comes up and whenever I ask people to share the full prompts, I'm never able to reproduce this locally. I'm running GPT-OSS-120B with the "native" weights in MXFP4, and I've only seen "I cannot fulfill this request" when I actually expect it, not even once had that happen for a "normal" request you expect to have a proper response for.
Has anyone else come across this when not using the lower quantizations or 20b (So GPT-OSS-120B proper in MXFP4) and could share the exact developer/system/user prompt that they used that triggered this issue?
Just like at launch, from my point of view, this seems to be a myth that keeps propagating, and no one can demonstrate a innocent prompt that actually triggers this issue on the weights OpenAI themselves published. But then the author here seems to actually have hit that issue but again, no examples of actual prompts, so still impossible to reproduce this issue.
Latency is not an issue at all for LLMs, even a few hundred ms won't matter.
It doesn't make a lot of sense to me, except when working offline while traveling.
I am still toying with the notion of assembling an LLM tower with a few old GPUs but I don't use LLMs enough at the moment to justify it.
I bet you could build a stationary tower for half the price with comparable hardware specs. And unless I'm missing something you should be able to run these things on Linux.
Getting a maxed out non-apple laptop will also be cheaper for comparable hardware, if portability is important to you.
Personally, I run a few local models (around 30B params is the ceiling on my hardware at 8k context), and I still keep a $200 ChatGPT subscription cause I'm not spending $5-6k just to run models like K2 or GLM-4.6 (they’re usable, but clearly behind OpenAI, Claude, or Gemini for my workflow)
I was got excited about aescoder-4b (model that specialize in web design only) after its DesignArena benchmarks, but it falls apart on large codebases and is mediocre at Tailwind
That said, I think there’s real potential in small, highly specialized models like 4B model trained only for FastAPI, Tailwind or a single framework. Until that actually exists and works well, I’m sticking with remote services.
It seems so obvious to me, but I guess people are happy with claude living in their home directory and slurping up secrets.
These are no more than a few dozen lines I can easily eyeball and verify with confidence- that’s done in under 60 seconds and leaves Claude code with plenty of quota for significant tasks.
I use Claude for all my planning, create task documents and hand over to GLM 4.6. It has been my workhorse as a bootstrapped founder (building nocodo, think Lovable for AI agents).
The biggest saving you make is by making the context smaller and where many turns are required going for smaller models. For example a single 30min troubleshooting session with Gemini 3 can cost $15 if you run it "normally" or it can cost $2 if you use the agents, wipe context after most turns (can be done thanks to tracking progress in a plan file)
There may be other reasons to go local, but I would say that the proposed way is not cost effective.
There's also a fairly large risk that this HW may be sufficient now, but will be too small in not too long. So there is a large financial risk built into this approach.
The article proposes using smaller/less capable models locally. But this argument also applies to online tools! If we use less capable tools even the $20/mo subscriptions won't hit their limit.
So I fired up Cline with gpt-oss-120b, asked it to tell me what a specific function does, and proceeded to watch it run `cat README.md` over and over again.
I'm sure it's better with other the Qwen Coder models, but it was a pretty funny first look.
The only legit use case for local models is privacy.
I don't know why anyone would want to code with an intern level model when they can get a senior engineer level model for a couple of bucks more.
It DOESN'T MATTER if you're writing a simple hello world function or building out a complex feature. Just use the f*ing best model.
When I tried gpt-oss and qwen using ollama on an M2 Mac the main problem was that they were extremely slow. But I did have a need for a free local model.
Let's be generous and assume you are able to get a RTX 5090 at MSRP ($2000) and ignore the rest of your hardware, then run a model that is the optimal size for the GPU. A 5090 has one of the best throughputs in AI inference for the price, which benefits the local AI cost-efficiency in our calculations. According to this reddit post it outputs Qwen2.5-Coder 32B at 30.6 tokens/s. https://www.reddit.com/r/LocalLLaMA/comments/1ir3rsl/inferen...
It's probably quantized, but let's again be generous and assume it's not quantized any more than models on OpenRouter. Also we assume you are able to keep this GPU busy with useful work 24/7 and ignore your electricity bill. At 30.6 tokens/s you're able to generate 993M output tokens in a year, which we can conveniently round up to a billion.
Currently the cheapest Qwen2.5-Coder 32B provider on OpenRouter that doesn't train on your input runs it at $0.06/M input and $0.15/M output tokens. So it would cost $150 to serve 1B tokens via API. Let's assume input costs are similar since providers have an incentive to price both input and output proportionately to cost, so $300 total to serve the same amount of tokens as a 5090 can produce in 1 year running constantly.
Conclusion: even with EVERY assumption in favor of the local GPU user, it still takes almost 7 years for running a local LLM to become worth it. (This doesn't take into account that API prices will most likely decrease over time, but also doesn't take into account that you can sell your GPU after the breakeven period. I think these two effects should mostly cancel out.)
In the real world in OP's case, you aren't running your model 24/7 on your MacBook; it's quantized and less accurate than the one on OpenRouter; a MacBook costs more and runs AI models a lot slower than a 5090; and you do need to pay electricity bills. If you only change one assumption and run the model only 1.5 hours a day instead of 24/7, then the breakeven period already goes up to more than 100 years instead of 7 years.
Basically, unless you absolutely NEED a laptop this expensive for other reasons, don't ever do this.
There may be other reasons to go local, but the proposed way is not cost effective.
For example, simple tasks CAN be handled by Devstral 24B or Qwen3 30B A3B, but often they fail at tool use (especially quantized versions) and you often find yourself wanting something bigger, where the speed falls a bunch. Even something like zAI GLM 4.6 (through Cerebras, as an example of a bigger cloud model) is not good enough for doing certain kinds of refactoring or writing certain kinds of scripts.
So either you use local smaller models that are hit or miss, or you need a LOT of expensive hardware locally, or you just pay for Claude Code, or OpenAI Codex, or Google Gemini, or something like that. Even Cerebras Code that gives me a lot of tokens per day isn't enough for all tasks, so you most likely will need a mix - but running stuff locally can sometimes decrease the costs.
For autocomplete, the one thing where local models would be a nearly perfect fit, there just isn't good software: Continue.dev autocomplete sucks and is buggy (Ollama), there don't seem to be good enough VSC plugins to replace Copilot (e.g. with those smart edits, when you change one thing in a file but have similar changes needed like 10, 25 and 50 lines down) and many aren't even trying - KiloCode had some vendor locked garbage with no Ollama support, Cline and RooCode aren't even trying to support autocomplete.
And not every model out there (like Qwen3) supports FIM properly, so for a bit I had to use Qwen2.5 Coder, meh. Then when you have some plugins coming out, they're all pretty new and you also don't know what supply chain risks you're dealing with. It's the one use case where they could be good, but... they just aren't.
For all of the billions going into AI, someone should have paid a team of devs to create something that is both open (any provider) and doesn't fucking suck. Ollama is cool for the ease of use. Cline/RooCode/KiloCode are cool for chat and agentic development. OpenCode is a bit hit or miss in my experience (copied lines getting pasted individually), but I appreciate the thought. The rest is lacking.
How anyone in this day and age can still recommend this is beyond me.
But I keep thinking: It should be possible to run some kind of supercharged tab completion on there, no? I'm spending most of my time writing Ansible or in the shell, and I have a feeling that even a small local model should give me vastly more useful completion options...
The best way to get the correct answer on something is posting the wrong thing. Not sure where I got this from, but I remember it was in the context of stackoverflow questions getting the correct answer in the comments of a reply :)
Props to the author for their honesty and having the impetus to blog about this in the first place.
I grabbed Gemini for $10/month during Black Friday, GPT for $15, and Claude for $20. Comes out to $45 total, and I never hit the limits since I toggle between the different models. Plus it has the benefit of not dumping too much money into one provider or hyper focusing on one model.
That said, as soon as an open weight model gets to the level of the closed ones we have now, I'll switch to local inference in a heartbeat.
Frankly, I don't understand it at all, and I'm waiting for the other shoe to drop.
That 128gb of RAM is nice but the time to first token is so long on any context over 32k, and the results are not even close to a Codex or Sonnet.