But then they talk about using a newly purchased Mac to do the inference, running at full capacity, 24/7. Why would you do that? Apple silicon is fast but the author points out: you're only getting 10-40 tokens per second. It's not bad, but it's not meant for this!
It's comparing apples to oranges. Yeah, data centers don't pay residential electricity rates. Data centers use chips that are power efficient. Data centers use chips that aren't designed to be a Mac.
Apple silicon works out pretty good if you're not burning tokens 24/7/365 and you're not buying hardware specifically to do it. I use my Mac Studio a few times a week for things that I need it for, but I can run ollama on it over the tailnet "for free". The economics work when I'm not trying to make my Mac Studio behave like a H100 cluster with liquid cooling. Which should come as no surprise to anyone: more tokens per watt on hardware that's multi tenant with cheap electricity will pretty much always win.
You also get the benefit of privacy, freedom from censorship, and control over the model used (i.e. it will not be rugpulled on you in three months after you've built a workflow around a specific model's idiosyncrasies).
Excusing everything else that u/bastawhiz said[0]; the obvious fact here is that Claude, OpenAI, Gemini et al. are quite literally burning through 100's of billions of dollars and selling it back to you for pennies on the dollar in the hopes that they get to be the only one left.
If I spend $10 growing Oranges and sell them to you for $1; then of course it's more expensive for you to do the growing.
I feel like I'm taking crazy pills. These models will become more expensive over time, it's functionally impossible for them not to, they just want to capture the market before they have to stop selling at a huge loss.
If you want a faster model, go for qwen3.6 35B (or gemma 4 26B if gemma models perform better for your tasks). There is a reason why people (myself included) haven't shut up about those two (especially the 27B). Its small enough to run at a decent speed (especially with the built in MTP that finally has official llama.cpp support) and for many workloads (every benchmark I have ever thrown at it) it is matching or surpassing models it has no right to.
A couple of days ago I woke up with my internet being down, started 27B in pi, told it to diagnose whats wrong by giving it my router's password, went to grab a coffee and by the time I got back, i had a full report with suggestion on how to proceed. I love openrouter and I use it for many things, but it is not cheaper.
Subjectivity and opinions based on personal experience with all those models implied naturally, I assume the 31B gemma has cases in which it edges out, I've just failed finding any and I have been running all 4 models mentioned since hours after each of them dropped nonstop for different tasks. Hell, for my hermes, I've started getting better results once I switched from gemma 4 26B to qwen3.5 9B, not even the massively improved 3.6 series. It just feels outdated/ cherrypicked to not use what by many accounts is the current consumer hardware SOTA if doing such an analysis.
More critically, in practice, setting up local models seems more like a hobby, an educational exercise, or an act of privacy control than it is for cost cutting or productivity.
This is like comparing e-bike at home with e-bike rental and concluding therefore we need to rent Toyota since it can go at similar speeds. Getting tired of bad posts getting much attention .
We all like our expensive toys.
Second thing is you can starkly upgrade the token generation locally if you use agent teams. Single conversations are memory bandwidth bound and don't fully make use of your compute. If you can batch tokens from multiple agents you can easily 5x token generation.
Shortening the lifespan?
I expect self-hosted to be quite competitive pretty soon. Github Copilot is already wildly more expensive than it was last month. People are going from spending a few bucks to a few thousand for that same usage. So, if it doesn't get a lot more efficient (like 3x the tokens, or more, from the same infrastructure), the prices will have to go up quite a lot to keep the lights on. Everything in AI is running partly on investors money, everyone is trying to buy a monopoly and insurmountable lead and some way to lock people into a specific model and ecosystem, but so far that hasn't happened (except for people who voluntarily lock themselves into a specific ecosystem, but even in those cases, it's usually easy to get the AI to help move to another, there are no truly unique features in AI that at least one, and probably three or four, other players don't also offer).
* Industrial power pricing
* Wholesale hardware pricing
* Utilization density of shared API
means API always wins a cost shootout.
Privacy & tinkering is cool too though
But in _every_ metric other than privacy it was better to run via OpenRouter than a local model, and not by a small amount.
Direct link to the comparison charts:
https://sendcheckit.com/blog/ai-powered-subject-line-alterna...
Add to that the privacy improvements and data protection and potentially further specific inferance if needed it's a no brainer.
Again, Ai is a tool, and the right tool for the job, I would wager with no evidence looked up, is that the majority of Devs would be happy with 10-30 per second locally.
If it costs me slightly more in the short term, but I don't depend on any providers' price gouging as soon as I fully depend on them, I will consider it a win.
This is common when processing PII. Lawyers, doctors our similar should not be using cloud solutions.
Also it's harder to setup and always more expensive than any cloud solution.
I wish people stopped deluding themselves — I regularly try (and benchmark for my purposes) local models and they are NOWHERE near the huge models like Sonnet or Opus. Nowhere. Yes, you can sometimes get plausibly-looking output for simple tasks, but for anything even remotely requiring thinking there is simply no comparison.
Local models are useful. I use them for spam filtering, and soon intend to use them for image tagging and OCR. But let's stop saying they can get us "anthropic sonnet levels of performance", because that's just not true.
also you gotta realize frontier models have massive "system prompts" that clog up the context window with garbage.
being able to write your own system prompts gives you a MASSIVE edge..
So we shouldn’t be comparing it to the cost of open router api access at all, we should be comparing it to the cost of a 4 credit university course.
tldr;
Hardware deprecation costs are the major factor.
But, if we assume ZERO hardware deprecation (not realistic), then local inference becomes super cheap.. roughly, 90%+ cheaper.
Third case: the break-even happens only if we can get at the very very very least, 8.7 years of useful hardware life. A more realistic number, however, when working 8 hrs/day and not of 24 hrs/day, is around 25 years.
So, for now, local inference is preferable if you deeply care about privacy. From cost perspective, it's still not there.
Also nobody I know picks local over OpenRouter on price. They pick it for offline, for data not leaving the machine, for no rate limits, for not having a provider go down mid-task. If $/Mtok is the only axis, sure, cloud wins.
In practice the pattern I see is leaving a small model running on easy background tasks while using the laptop normally, not a dedicated inference box hammered flat out for 5 years.
Now, it looks like the providers I use have good limits. But I do worry about this.
Obviously if RAM apocalypse passes by then high-end configurations preserve resale value worse than base models, but still it's hefty bonus of Apple hardware that might change math a lot.
Next paragraph
> At ~50-100 watts and $0.18/kWh that's $0.009 or $0.018 per hour. $0.02 per hour. $0.48 cents per day for the electricity to be running inference at 100%.
lol
So, for comparison, a 5090 has 32GB of VRAM and you can get one for ~$3000 maybe. To go beyond that memory with current generation (ie Blackwell) GPUs, you have to go to the RTX 6000 Pro w/ 96GB of VRAM and that's almost $10,000 for the GPU by itself. Beyond that you're in the H100/H200 GPUs and you're talking much bigger money.
Part of the problem here is the author is looking at laptops. That's the only place you'll find the M5 Max currently. The real problem here is that the Mac Studios haven't been updated in almost 2 years. There were configs of those with 256/512GB of RAM but they've been discontinued, possibly because of the RAM shortage and possibly because of they're reaching EOL. Apple hasn't said why. They never do.
Many expect M5 Ultra Mac Studios in Q3 and the M5 Ultra may well have >1TB/s of memory bandwidth (for comparison, the 5090 is 1.8TB/s). Memory bandwidth isn't the only issue. A 5090 will still have more compute power (most likely) but being able to run large models without going to a $10k+ GPU could be huge.
But yes, it's hard to compete with the scales and discounted electricity of a data center. Even H200 compute hours are kinda cheap if you consider the capital cost of what you're using.
I've looked into getting a 128GB M5 Max 16" MBP. That retails for $6k. You might be able to get it for $5400. But I don't think the value proposition is quite there yet. It's close though.
E.g.
Privacy
Uptime
Future cost structure controls
This is a field that has moved very quickly. And it has moved in a direction to try to trap users into certain habits. But these habits might not best align with what best benefits end users today or some time in the future.
> 64 gigs should run a model like Gemma 4 31b
No, it can run anything in the 70B range. It's a notable quality upgrade from the 30B, which isn't obvious because the famous flurry of April releases didn't contain any 70Bs.
It can also run 120B in UD-Q3. Or 230B disk-streamed.
If small model is great it will be hosted with good electricity cost and will be utilized 24/7.
Isn't it 2+2 of economics ?
CPU is a commodity, and we are still buying cpu and ram from vendors for same reason
Running locally, you get confidentiality of knowing your tokens are only ever being processed by your own hardware. You get the integrity of knowing your model isn't being secretly or silently quantized differently behind the scenes, or having it's weights updated in ways you don't want. And you get the availability of never having to worry about an API outage, or even an internet outage, for local inference capacity.
And this isn't even starting to address the whole added world of features and tunability you get when you control the inference stack. Sampling parameters, caching mechanisms, interpretability etc.
OpenRouter may be cheaper than frontier labs, but you still lose all of these benefits from open weight models the moment you decide to rely on someone else's hardware for your processing.
But you are dependent on them, which is the biggest factor IMO, there was a website posted here before of people getting banned from using it over silly reasons, not to mention price hikes, or privacy concerns. Maybe now it’s more expensive or slower to run locally, but you are in full control of everything.
When I see so many options, that looks like it would take months to audit whether it actually makes sense and is safe to use. But I guess some people are fine with YOLO-ing it.
This is why the idea that the AI labs are in trouble because inference will be a commodity is _completely backwards_. Some of the largest and most powerful companies in the world sell commodities. They compete on scale and efficiency, and you are never going to be able to compete with the big labs on either.
* but Apple will collect all your keystrokes anyway
Oh, and the author didn't mention at all anything related to inference optimization, so no idea if they even know about or enabled things like speculative decoding, optimized attention backends, quantization, etc.
At least AI slop would have hit on far more of the things I listed above. This is worse-than-AI.
Chances are that token prices will go down, but chances also are that the AI bubble pops and all of a sudden all these companies will either have to make a buck out of the inference or go bankrupt.
Getting your own hardware just grants you stable pricing.