by Greenpants
32 subcomments
- I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.
I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.
It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).
Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
by horsawlarway
8 subcomments
- For personal use, yes.
I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.
I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.
To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.
For my personal needs, free beats $100/m.
I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).
Some example projects
- Replacement launcher for android tvs (with usage monitoring and tracking for kids)
- Custom admin portals for my k8s cluster services
- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)
- Grocery list management and meal planning (mostly via openclaw)
- some custom workflows for 3d asset generation in comfyui.
---
Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
by bluejay2387
4 subcomments
- About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.
Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
by codinhood
10 subcomments
- I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.
Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.
Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
by pierotofy
7 subcomments
- Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
- The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.
If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.
If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.
The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
- For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.
Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.
Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.
Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.
EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
- Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).
I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.
And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.
Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
- Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.
I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
by garethsprice
0 subcomment
- Using OpenCode + OhMyOpenCode + Qwen 3.6 35B-A3B Q_4_KM on an Ada 4000 (20GB VRAM) at 55 tok/sec for generation (slower than it sounds as OpenCode has a bunch of context it adds). Meaning to check out pi when I get a minute as I hear that one mentioned a lot lately.
I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.
This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.
It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).
The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
- No. I've tried all the OS models up to Qwen 480B and Kimi (the biggest models). None come even close to Claude.
I do mostly scripting, devops, data processing and systems stuff (ansible playbooks, managing network devices, deploying new software for various things that involves reading docs, writing helm charts, modifying existing ones etc).
All other models Gemini, Chatgpt, grok and all OS models don't come even close. I'd rather use Sonet than Qwen.
It's a sad reality. I was thinking about implementing maybe some sort of "sanity checking" by running every prompt twice on two different models doing sanity checking of the first on the second.
Elaborate knowledge systems help a little, but personally I think Anthropic must be doing something "clever" with its models (processing via multiple models etc). Nothing else in my mind explains the discrepancy.
by goranmoomin
0 subcomment
- I'm not using my models locally, but the majority (80% or more) of my coding agent sessions run on open source models, i.e. DeepSeek v4 Pro and Kimi K2.6 with thinking.
A point that I haven't seen come up a lot, but is very valuable to me is that for open source models, I can select the inference provider myself (even if it's not a local GPU), which means that I can enjoy superb speed (i.e. 300 tok/s) while still spending much less than the big providers.
My experience is that if you were fine with the coding models of yesterday (i.e. Claude Opus from Jan/Feb of 2026), you will be fine with either Kimi K2.6 or DeepSeek v4 Pro. Kimi is a bit more smart but has only 256K context and the performance deteriorates (and sometimes just gets stuck) when it fills up the context window. DeepSeek v4 has a 1M context and performs just as well with much less issues. And they both generate very idiomatic code, gives the same vibe of Opus a few months ago.
Since it's also fast (and does not fixate on trying to fix impossible problems, unlike the recent Opus/GPT 5.5 models), a big benefit is that you still control and steer the coding agent and you won't be losing focus like the major models. They are smart, but they don't fixate as much on trying to do stupid things, and since it's fast, you can just interject. It's a much more pleasant experience than the latest models.
I still use the latest models time to time when I expect the agent to fixate all of the problems and figure out everything themselves, but for me open source models are like 80~90% of all of my sessions.
- I'm using 4x RTX 5070's and first-gen AMD threadripper (1950X) to run Qwen3.6 27B (MTP) Q6_K with llama.cpp and it works great as a daily driver with Pi. Around 50-60 toks/sec. I also connect a few other applications to it such as OpenWeb UI and recently set up Bifrost, an LLM gateway, to be the primary access point for the models I serve.
I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!
You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.
Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.
I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.
My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
by wsintra2022
1 subcomments
- Reading through these comments, I can't tell any more whats bots posting on behalf of the AI providers trying to dissuade or whether people just have had negative experiences with local ai models.
IMO, Qwen 3.6 27B 8k quants running on a Mac Studio 64g ram, incredible?. No it is not frontier general super shit, its just good. That's it, its good. Its free and private and can take an experienced engineer from being lazy to being really lazy, and that's magic right there. I use llama.cpp and opencode and have great moments of planning some code changes, and letting it run. Walk away. Chill in the hamoc, clean the dishes, have a wank, whatever. Use tmux and ssh in and check in on it. THIS is where the incredible comes in. Anyone telling you otherwise, well check their motives. I have no skin in the game. I just have an easy lazy time.
by cuttysnark
2 subcomments
- I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.
Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.
In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
by grmnygrmny2
0 subcomment
- Just sharing my $0.02 here - I have ethical objections to using OpenAI or Anthropic products so I was a reluctant adopter of LLMs at all. Local models address most, though not all, my moral objections so I’ve been using them for work and personal projects for about a month.
The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).
It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.
I’m using Pi as my harness.
- I have been using local LLMs for about a year and I have settled now on Qwen3.6 27b dense model in GGUF on Mac Studio with 512G of RAM with open code as the harness and llmster(LM Studio). I have also used the Qwen 3.6 35B-A3B but the dense model's accuracy is next level with the tradeoff being tokens/sec. With the Qwen3.6 27b, I usually get anywhere from 25-40 tokens/second. Initially I used them for simple tools but for the past 3-4 months, I have been actually doing production grade coding in C/C++ (Automotive Software stack) and Python (Tools) with Qwen3.6 27b.
The tokens/sec may be less but that kind of helps me in going at the right pace. The workflow I use for green field development / rewrites is to pair with Sonnet for design/architecture, reasoning and a detailed execution plan. I then feed this piece by piece with precise prompting and that does the job. For brown field, it is often a judgement call. There are occasions when I have found Local models to be limited in their reach and I resort to Claude Code
Some of my recent work using Qwen 3.6:
1. Complete rewrite of Power management Service in C using the existing C++ code as reference
2. Tool to parse contents from really complex specifications in Excel format
3. Tool to translate CJK contents to english for feeding into KG
by jodoherty
1 subcomments
- I use pi with an RTX Pro 6000 Blackwell to run Gemma 4 31b to do all my agentic coding.
I find it useful.
This side project highlights a similar approach to how I scope and tackle projects at work now:
https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md
https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...
You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.
I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.
Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.
My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
by macwhisperer
0 subcomment
- I code with like a slew of 20+ custom baked models of all sizes, in various fully custom multi-model harnesses that use different bindings...
the harnesses themselves are just as important as the models...different harnesses give different responses with the same prompt, same model...
if you have the 20/mnth claude sub or codex, you really should be using that to build a good local harness for yourself... claude won't be 20$ forever
build the stack first! when you get that new comp with massive ram, youre already set, just run a larger model!
big cloud models are incredibly good at building and teaching about local ai!
have fun in the rabbit hole!
if you are memory constrained like me, check out my custom models https://huggingface.co/macwhisperer
by HappySweeney
0 subcomment
- I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
- I have been heavily relying on Qwen3.6-27B-UD-Q4_K_XL.gguf -model and Pi agent (https://pi.dev/) for local tasks and coding. I have used llama-cpp-turboquant fork with some custom cherrypicked MTP patches from another fork.
I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.
I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.
by GodelNumbering
0 subcomment
- As someone that spends all day every day talking to LLMs, I'd say the OSS frontier models + a good harness is already a sufficient combo. For local deployments, we are missing one or two hardware generations (and may not get that soon since hardware companies are heavily favoring datacenter segment) to fully move to a local setup.
by blurbleblurble
3 subcomments
- My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
by pianopatrick
0 subcomment
- I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.
Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."
Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.
Like "The Local AI challenge"
by cheekygeeky
0 subcomment
- Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
- But, guys, when you say Claude/ GPT models, do you stop to think what are these "models"?
One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.
As a matter of fact, think about these operations, api endpoints, observe their output.
These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.
I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.
The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.
Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.
I am sure that if you change your perspective you will start noticing the scale of the "magic".
- I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogies
but perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)
I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
by bravetraveler
0 subcomment
- I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Anthropic/OpenAI/etc even in the unlikely event these are the last local models released; simply not needed. Entirely fine without and in-model tool usage covers my currency concerns.
- We have set up two DGX Sparks at work and are self sufficient for our AI needs. It is not SOTA, but it works really well for our needs. No matter what happens around cloud-hosted AI in the future, we will have decent in-house AI without further investments or expenses. We are a company of 24 people.
by big-chungus4
0 subcomment
- I can run Qwen3.6-35B-A3B at 20 TPS on my laptop with RTX 5070 Ti, with partial offloading to RAM. But the most I do is mess with it when I'm bored. I do coding by hand, but I often run autoresearch loops using free models, right now it's MiMo code. Autoresearch often requires my GPU, so it wouldn't be feasible to do when all of my GPU is used up by a local model. For mundane tasks like extracting and formatting specific structured text, I use Gemini in Google search
by neuropacabra
0 subcomment
- I went for this one https://huggingface.co/yuxinlu1/gemma-4-12B-coder-fable5-com... and seems very fast, resonable...don't expect 100% replacement, but a lot of things can be done with local LLMs today.
- Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.
- I tried. It works in theory: https://blog.frankel.ch/tokensparsamkeit-coding-assistants/#...
Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
- I have tried it and I use it. I think it's going to become the standard way of operating, especially when they start charging us an API fee, which is supposedly the real cost. But of course, with how much they charge for the token and depending on the model, there are so many factors that I think the future is heading towards local models. I believe there are good models out there, and the key is the concept of "pruning," where you select the layers that interest you most and try to reduce the hardware cost of these types of models. The Qwen and Gemma models have been discussed here, but Kimi, which is a fairly powerful model with an efficient pruning system, could be your perfect free co-pilot in terms of coding, and could coexist with the more powerful Opus or Gemini models. The key concept is skills that make this process transparent.
by mitchell_h
3 subcomments
- Tried. The context windows just weren't big enough.
- Not replaced but supplemented. For off-line coding current setup is pi + ds4-server + DeepSeek-V4-Flash REAP25 (on M2 Max 96gb). For simpler programming related (e.g. text2sql) as well as synthetic data generation, current best for me is llama.cpp + Gemma-4-26B-A4B (on gpu 7900xtx 24gb; sometimes nemotron-cascade-2-30b-a3b for 1M context). That and (dabbling now) auto-research uses lots of tokens. Used to get paused running out of token quotas all the time. The 1st local model I found somewhat useful to me was glm-4.7-flash, and it's gotten way better since. Recently between OpenCode Go choice of models at many price points, and DeepSeek-V4 dropping the IQ/$$$ by multiples, have become less reliant on local llms for this auxiliary work. Claude I use but with Zai GLM-5.2 subscription. And maintain GPT subscription for quality models.
- Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.
- I replaced Claude with DeepSeek V4 Flash via API. Not local, but 95% the quality at 5% the price. Close enough.
- Yes, we use Qwen 3.6 27B Q6_K. We use it on Radeon R9700 32GB and it delivers 50tps with MTP. We compare it to Sonnet from 4-6 months ago when it comes to output. Totally usable for daily coding.
by CuriousRose
0 subcomment
- An equally important issue with local AI use (not coding specific) is ensuring that the harness has fast and up to date data if recency is important in your querires (new package features, docs, etc). Hosted models do web search incredibly well and I think this is a huge part of output quality.
I don't use local hosted models anymore due to hardware contstraints, but I do have some degree of search anonymisation attached to my OpenCode and OpenRouter connected open models.
On my Macbook I run OrbStack that has the following docker containers set to route through a Mullvad based gluetun.
- Firecrawl - fast web scraping
- SearxNG - metasearch
- CloakBrowser - tursile bypassing Playwright alternative
If you wanted to get fancy with the proxy rotation, you could setup numerous instances of Playwright each with their own Mullvad wireguard key in different locations.
- I think nearly everyone mentioned Qwen, so my turn I guess. Qwen 3.6 35B Q8 (MTP), on a Strix Halo, with llama.cpp. Around 40-50 t/s. Really great pefromance, I get always suprised by its capability. I used with forge-code directly in zsh. For long context 150k+) it start degrading and forgetting.
by bijowo1676
0 subcomment
- One of the interesting setups I saw is using expensive frontier models to write and update markdown for your app: specs, product requirements, architecture, etc
but then use cheap/local model to implement the specs.
Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code files
but this requires second and third passes, to smooth out the rough edges
has anyone tried that?
- I think it is work to set up but I'm also learning a lot setting it up. Mainly using qwen/qwen3.6-35b-a3b mlx with my 48GB M4 MBP which leaves me just enough headroom for docker dev-container and other basics. I use LM Studio to run and am using it via VSCode. A big difference made the system prompt improving the tool integration (I asked GPT for guidance on that). Before that it was not making changes but regenerating code often messing up than helping.
I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.
What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
by SupLockDef
1 subcomments
- Local isn't new for me. I am still coding my stuff, but Qwen3-coder:30b on my old rig with a gtx 1070 16gb RAM does wonders for me.
I mostly use it as a google search if I forget a thing, or doing the boilerplates.
I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.
$0.00 / month. That's the budget.
- I have an RTX 4060 12gb vram. Qwen3.6 35b. I stopped paying for Github Copilot. But I wouldn't say I replaced frontier models with a local one. I still have some dollars in my openrouter when I need to. Also to get interactive agentic coding speeds I need a high tps. So my quant is very small. And I would say a coding harness that is fully extensible is a must to create fully custom workflows tailored for low specs. I use pi (not perfect, still found some hard coded, non-extensible parts)
by ryandrake
2 subcomments
- Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
by anubhav200
1 subcomments
- Yes, llama.cpp, qwen27b, 35b, claude code. Llama-cpp-manager for managing llama.cpp configs (https://github.com/anubhavgupta/llama-cpp-manager)
- I’ve tried in a 36GB MacBook Pro and haven’t had much success beyond very basic work. Issue for me was the context runs out quick even with smaller models and it’s slower. To get some half decent performance I’d imagine you want 128gb memory and are spending a lot more on hardware. At that point it becomes a question on whether you’d rather have frontier models at a subscription or sink that money into your own equipment. Of course, for those with privacy in mind your only option is forking out the cash for the higher end machines.
by BiraIgnacio
0 subcomment
- I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed).
I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
- This needs atleast a 30b model or Mr higher and so for most folks it means purchasing a new machine. Given the ram costs this may be become prohibitive and a monthly subscription may feel better roi
- I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
- Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..
by NetOpWibby
1 subcomments
- I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).
by zaptheimpaler
2 subcomments
- I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
- I have not. We use openspec with our projects at work. To try and simulate a local rig without spending big cash. I use the hosted models and pay for them with the latest popular local model.
Most small local models don't get tool calling right, however the larger models are now doing this correctly now.
One thing local has not accounted for, is most productive engineers are running multiple cli chats at a time with git worktrees. I normally hover around 3 worktrees + cli-chats.
- There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
by michaelhoney
0 subcomment
- don't think she has posted here, but Vicki Boykis blogged about this today:
https://vickiboykis.com/2026/06/15/running-local-models-is-g...
by shironnnn_
0 subcomment
- I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.
Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.
The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
by adam_patarino
0 subcomment
- Yes! And we are using it to build Rig AI to make it easier for anyone to do it too!
We are post training qwen 3.6 and combining it with a custom inference engine and harness to get the most out of a smaller model.
- Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.
Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
by cloudengineer94
0 subcomment
- I have tried in both my Mac and my desktop (Rtx 5090) with Gemma 4 and Qwen and so far nothing is quite replacing Claude Code or Kiro for spec driven architecture & development.
I do think we are slowly getting Gemma 4 was a big jump
- Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.
I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.
https://github.com/ndom91/llama-dash
by anonymousiam
0 subcomment
- This was posted shortly after your Ask HN post:
My Homelab AI Dev Platform
https://news.ycombinator.com/item?id=48542433
- Running AMD Lemonade as the daily rig, Started with Ollama then over to LMStudio and now standardized on AMD Lemonade which has been helpful to monitor cRAM, CPU, GPU and gRam. The multi-models on Lemonade make it straight forward to run a stack for LLM, Voice to Text, NPU, and Image Generation. Platform also works with Nvidia, Apple, Intel and AMD chip sets.
- https://hugston.com/models/anthropics-fable-qwen36-35biq4-nl
- I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.
[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
- I haven't but I'm on the path to attempt this. I want to get a DGX Spark and will be trying Qwen and Kimi.
- I have a mac with loads of ram but I cannot even justify the electricity cost when deepseek is so better than anything I can run locally (including heavy quantizations of deepseek itself) and costs pennies. It's crazy how cheap it is!
- I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.
by Departed7405
0 subcomment
- I tried but OpenCode doesn't have great local model integration. It's just a pain in the ass to set-up.
Plus, you now have zero-data retention models, so the privacy argument has kind of faded.
- Yes, I have.
1. Two RTX 3090s in Linux 22.04
2. Running Qwen3.6-27B Q6_K_XL GGUF
3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine
4. Many times it solve problem that Codex can't solve
https://medium.com/p/f237d575e861
by thesuperbigfrog
0 subcomment
- Here is a nice setup that works well:
https://discourse.ubuntu.com/t/use-workshop-to-run-opencode-...
- I use Qwen 3.6 35B A3B for agentic coding using GitHub Copilot Extension for VSCode. Mac Mini 128GB as the hardware. Seems reasonable for that model size, but I notice looping issue when problem becomes too big to solve. You can use it to do something that you know how to do (saves time).
- Yes, for client projects where privacy and security is important, but no enterprise contract:
Open code against Infomaniak hosted OSS models: Qwen3.5-122B-A10B-FP8, Kimi-K2.6.
I use API keys for billing. It performs like Dec 2025 in terms of my productivity back then.
- Will the inevitable M5 releases from Apple change this equation in any meaningful way?
I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
- mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.
Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.
Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.
Performance as described in the tables in the readme here:
https://github.com/antirez/ds4
...with a bit less than half that at "low power" (30w). Both are usable.
by derekered
1 subcomments
- I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.
- Asking for feedback:
Sorry for hijacking the convo, but you (with local models) are my target audience in terms of hardware.
Is anybody willing to test my new app https://document.bot? It is like Cursor IDE but custom harness for knowledge work (PDF's, MS Office files etc).
You can connect your existing offline LLM models through LMStudio, Ollama, or app managed LLM models (Qwen3.5, Gemma 4, etc)
Might have to make a new Ask HN post for this, but again, you are users with good hardware setups.
- I would love to do this if it didn't require such a huge amount of RAM. And the difference in quality is worth it to pay $20-$100/mo if data retention doesn't matter to you.
- So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?
How much does this ware out the hardware?
Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
- Well not local but using Mistral Vibe CLI for a fixed 17€/m illimited is an incredible value for money.
by jmichaelson
0 subcomment
- I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).
I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
- - What would you say is the best model for coding at the moment that can run on a high end consumer GPU? (Assume an RTX 3090/4090 is available.)
- What "stack" do you recommend? Llama.cpp + OpenCode?
- Yes qwen 3.5 122b+ dgx is working wonders and I ko longer subscribed to any cloud api now.
I will post a project which I accomplished in 9 days of long horizons running.
- I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
by deepvibrations
0 subcomment
- The TLDR is that the best setup is probably Mac Studio (128GB RAM) / MacBook (36GB) with Qwen 3.6 35B (3B active params), or Qwen 3.5 122B model (this one is slow though).
These models are still very capable with good hardware, but they do lack the deep reasoning of major models and require more precise prompting.
So unless you really need the privacy, or have a lot of excess cash, it is not recommended, as considering the price of major models, it's just extremely cost inefficient!
by carlossouza
0 subcomment
- This should be a recurrent question posted every month
- Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
- I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
- I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.
- My experience so far has been Qwen3.5:32b-a3b-coder via Claude Code on a MBP 64gb M4, and a MBP 32gb M5. Just found about qwen3.6 so downloading that currently.
on 64GB M4 I find it's able to do things fairly well. The few times I run out of tokens, I hop over to that and I'm mostly unimpeded. I compare it to the Haiku models, where you have to go in and be surgical about your changes, or like others have said, guide a junior.
on 32GB M5, I find that it works, but around the 30% ctx threshold it slows down quite substantially, so more need to be surgical in your requests. I'll often just have my IDE open and Claude. But maybe I've been too comfortable talking to Sonnet/Opus and so forget I need to be more deliberate in my requests.
My finding here is that the harness is a big part of the problem. CC seems to be very good with Qwen in my experience. Better than OpenCode.
I also run DeepSeek for some other non-structured data tasks and to generate a to-do out of that. That's not coding, so won't go into that, other than to say it's very competent as a small model left to run in the background and automate small parts of my life and process.
tl;dr it's totally doable on a 32gb mbp using ollama, but be precise in your requests and guidance.
by kristianpaul
0 subcomment
- Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solved
Nemotron super 3 110B works well for 1M context long vibecoding sessions
I also use Pi harness with no extension
by fortyseven
0 subcomment
- I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
by mark_l_watson
0 subcomment
- I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.
I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.
For me, the problem with all local LLM-basic coding agents is slow runtime.
[1] https://leanpub.com/read/local-coding-agents
- i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
- I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
- yes
harness - pi+custom extension for subagents
model - qwen3.6 35ba3b q4km
hardware - intel arrow lake with 32gb ram
server - llama.cpp vulkan
performance - 15-18t/s generation 50-150t/s pp
planning and task creation is still using claude/gpt but they dont touch the code. All coding is done using this setup.
Example of project made using this setup easyanalytica.com , its of medium size complexity
- Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5. Not even close. The only open models that are in that neighbor are around 1T params, so forget about running at home.
It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.
There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
- I wonder what languages people are using; I imagine smaller models would be decent at bash/python but significantly worse at something like rust
- Of course.
Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.
Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.
- I wish I could. But, the hardware requirements are just too expensive for me.
by AH4oFVbPT4f8
1 subcomments
- Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.
by sermakarevich
0 subcomment
- yes.
- smarter models to create tasks
- local qwen3.6:36B for tasks execution
here is how in details https://news.ycombinator.com/item?id=48520757
by SkitterKherpi
0 subcomment
- It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
by ElenaDaibunny
0 subcomment
- we've been building local agents with vision models, works great for gui automation but coding tasks still need cloud models for reliability
- tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.
by agentbc9000
1 subcomments
- Kimi K2.7 is very good - i have been testing it and its very very good, Fable 5 level of goodness.
- Using qwen3.6 27b locally with Claude code, it works well for simple coding tasks
- I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.
Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
- Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.
by SugarReflex
0 subcomment
- Is anyone using Aider?
Is there any decent CLI alternatives to it?
- Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.
I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.
Power usage is also totally not an issue, AI workload is very different from gaming.
tldr
llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
- Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.
- Related: Are there any viable distributed AI models?
Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.
Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.
Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
by anubhav200
0 subcomment
- Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)
- I tried, but honestly, all end with lack of tool or configuration or hardware config. None of them work for me. At end paid apis only providing productivity else free local end with inveting time and less effecient work
- No, but I use GLM5.1 instead of Claude/GPT.
- Do you recommend Ollama or bare llama.cpp?
- Anyone here running a tinygrad?
- Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.
- I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).
For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").
That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.
One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
by sometimelurker
0 subcomment
- yeah I use one one the small MTP qwens and pi
- Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.
- I’d be surprised if this was useful for much. Claude is already almost too slow to do anything serious I’d consider using it for outside of grunt work without parallelizing.
The only reason it’s economical is because it’s massively discounted if you’re not paying API rates.
by hacker_homie
0 subcomment
- I do qwen3.6 on an amd ai max laptop getting about 6-10tok/s it’s slow enough that I can follow along.
It has issues with design and large piles of code.
Otherwise it’s a good programming buddy.
by lowbloodsugar
0 subcomment
- If you want to try it out before dropping $$$ on a GPU, just run something that would fit on your target GPU but online.
by platevoltage
0 subcomment
- I run very small models locally for code completion and writing boiler plate. I still use Claude in a web browser on occasion since it's free, but the second that goes away, I'll be done with it. They get none of my money.
- Not with a local one, but I moved to DeepSeek v4.
Albeit I plan to move to local ones when I will get my hands on a 256+ GB macbook.
Local inference is good enough to help me with my daily job, and doesn't turn me into an assistant to the LLM.
by jay_kyburz
0 subcomment
- Can anybody let me know how just chatting with Qwen3.6 on a Strix Halo 128GB
If I give it a page of context, can it write a linked list or identify a bad line of CSS?
Is there anywhere online I can chat with a model I could be running at home to see how good it is?
by thrownaway561
3 subcomments
- I just use DeepSeekV4 Fast... It's cheap as hell. Currently my monthly usage has been
67M Ouput
51M Input
Total $0.83 dollar.
I honestly don't understand why people just don't use DeepSeek.
by jeffrallen
0 subcomment
- I use Qwen 3.6 on a remote GPU that my work offers. Works fine. Slow and steady, works hard, gets the job done. Probably better at diagnosing than making new code, but whatever.
- I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.
by queeshonda
0 subcomment
- Yes, your mom
by deployementeng
0 subcomment
- partially yes.
by dude250711
2 subcomments
- Yes, running a local model on a natural wetware substrate here.
Recommended setup: plenty of nutrients, some caffeine and a quiet environment.
Performance - not currently measured in tokens: roughly average.
by DetroitThrow
0 subcomment
- No.
- pre-replaced it with combo of my brain, vim, an assortment of other CLI/TUI tools, etc
by salutonmundo
0 subcomment
- it's called your damn brain.
- never started. using wither qwne3-xoder-nezt or qwen3.6 35b
if youre shoopping for a new pc, very easy to justify 128gb vram
by sanchitmonga22
0 subcomment
- [flagged]
by hectortemich
0 subcomment
- [flagged]
by echoforgex
0 subcomment
- [dead]
by o2zer0cool
0 subcomment
- [flagged]
by kordlessagain
0 subcomment
- [flagged]
- [flagged]
by HardAnchor
0 subcomment
- [flagged]
- [flagged]
by thousandflowers
0 subcomment
- [flagged]
by huangchengsir
0 subcomment
- [flagged]
- [flagged]
by daischsensor
0 subcomment
- [flagged]
by aplomb1026
0 subcomment
- [flagged]
- [flagged]
by fouadlvlup
0 subcomment
- [flagged]
- [flagged]
- [dead]
- [flagged]
- [flagged]
- [flagged]
by Pranavsingh431
0 subcomment
- [flagged]
- [dead]
- [flagged]
- [flagged]
by adam_patarino
0 subcomment
- [dead]
by startuphakk
0 subcomment
- [dead]
by adam_patarino
0 subcomment
- [dead]
by ericmaciver
0 subcomment
- [dead]
by adam_patarino
0 subcomment
- [dead]
by codelong888
0 subcomment
- [dead]
by nicechianti
0 subcomment
- [dead]
by iluvcommunism
0 subcomment
- [dead]
- [dead]
by aiexpo_app
0 subcomment
- [flagged]
- for crying out loud... why would you deprive yourself?
- Anyone doing it with a "rent a GPU over the network" path? Is that at all cost effective for any use case?
- Local? No.
Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.
Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.
Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200
- Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations