by moqizhengz
8 subcomments
- Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s.
This outperforms the majority of online llm services and the actual quality of output matches the benchmark.
This model is really something, first time ever having usable model on consumer-grade hardware.
- I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:
IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GB
And no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .
I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.
I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.
Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.
I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.
Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.
- My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations.
1 │ DeepSeek API -- 100%
2 │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5%
3 │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0%
4 │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3%
5 │ qwen3.5:27b-q8_0 (thinking) -- 75.3%
I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.
by d4rkp4ttern
1 subcomments
- For every new interesting open model I try to test PP (prompt processing) and TG (token gen) speeds via llama-cpp/server in Claude Code (which can have at least 15-30K tokens context due system prompt and tools etc), on my good old M1 Max 64GB MacBook.
With the latest llama-cpp build from source and latest unsloth quants, the TG speed of Qwen3.5-30B-A3B is around half of Qwen3-30B-A3B (with 33K tokens initial Claude Code context), so the older Qwen3 is much more usable.
Qwen3-30B-A3B (Q4_K_M):
- PP: 272 tok/s | TG: 25 tok/s @ 33k depth
- KV cache: f16
- Cache reuse: follow-up delta processed in 0.4s
Qwen3.5-35B-A3B (Q4_K_M): - PP: 395 tok/s | TG: 12 tok/s @ 33k depth
- KV cache: q8_0
- Cache reuse: follow-up delta processed in 2.7s (requires --swa-full)
Qwen3.5's sliding window attention uses significantly less RAM and delivers better response quality, but at 33k context depth it generates at half the tok/s of the standard-attention Qwen3-30B.Full llama-server and Claude-Code setup details here for these and other open LLMs:
https://pchalasani.github.io/claude-code-tools/integrations/...
- I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
- How does one choose between "fewer parameters and less quantization" vs "more parameters and more quantization" ?
by Curiositry
5 subcomments
- Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
by PeterStuer
1 subcomments
- I am running both Qwen-coder-next and Qwen 3.5 locally. Not too bad, but I always have Opus 4.6 checking their output as the Qwen family tends to hallucinate non existing library features in amounts similar to the Claude 3.5 / GPT 4 era.
The combo of free long running tasks on Qwen overnight with steering and corrections from Opus works for me.
I guess I could just do Opus/Sonnet for my Claude Code back-end, but I specifically want to keep local open weights models in the loop just in case the hosted models decide to quit on e.g. non-US users.
- I’ve been benchmarking GGUF quants for Python tasks under some hardware configs.
- 4090 : 27b-q4_k_m
- A100: 27b-q6_k
- 3*A100: 122b-a10b-q6_k_L
Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.
by devonkelley
1 subcomments
- The gap between "this model can answer my question" and "this model can reliably execute a multi-step task in a production loop" is where all the interesting problems are. Local models are getting surprisingly good at the first thing. The second thing is still a frontier model problem and honestly the frontier models barely do it either.
- For roughly equivalent memory sizes, how does one choose between the bit depth and the model size?
by computerex
0 subcomment
- You can use my new golang inference engine to run variants of Qwen 3.5 faster than llama.cpp:
https://github.com/computerex/dlgo
- We did run it locally on a free H100, and it performed awfully. With vLLM and opencode. Now we are running gpt-oss-120b which is better, but still far behind opus 4.6, the only coding model which is better than our most experienced senior dev. gpt-5.3-codex is more like on the sonnet level on complicated C code. Bearable, but still many stupidities. gpt-oss is hilariously stupid, but might work for typescript, react, python simple tasks.
For vision qwen is the best, our goto vision model.
by latrine5526
0 subcomment
- I have a 5090d and got ~140 token/s output when running qwen-3.5-9b-heretic in lmstudio.
I disabled the thinking and configured the translate plugin on my browser to use the lmstudio API.
It performs way better than Google Translate in accuracy. The speed is a little slower, but sufficient for me.
by benbojangles
0 subcomment
- I'm running Qwen3.5:0.8b locally on an Orangepi Zero 2w using llama.cpp, runs just fine on cpu only. If I want vulkan GPU I have run qwen3.5:2b locally on a meta quest 3 with zeroclaw and saved myself hundreds of $$$ buying a low power computer. I recommend people stop shopping around for inflated mac minis and look at getting a used android phone to load local models on.
- Anyone providing hosted inference for 9B? I'm just trying to save the operational effort of renting a GPU since this is a business use case that doesn't have real GPUs available right now. I don't see the small ones on OpenRouter. Maybe there will be a runpod serverless or normal pod template or something.
Also does 9b or 9b 8 bit or 6bit run with very low latency on a 4090?
- What would be optimal HW configurations/systems recommended?
- How does 397B-A17B compare against frontier? Did anybody try? Probably needs serious HW that most people don't have.
- So many variants of these models. The ggufs from unsloth don't work with ollama. Perhaps wait for a bit for the latest llama.cpp to be picked up by downstream projects.
If you're on a 16GB Mac mini, what's a good variant to run?
by RandomGerm4n
1 subcomments
- 9b with 4bits runs with around 60 tok/s on my RTX 4070 with 12GB VRAM and 35b-A3B runs with around 14 tok/s and partial offloading. For roleplaying I prefer the faster 9b Version but for coding tasks both aren't really usable and Claude is still way better especially if you manage to persuade your employer to give you unlimited access.
by veritascap
0 subcomment
- How does scaffolding work with these local models? Skills, commands, rules, etc. do they all work similarly? (It’s probably obvious but I haven’t delved into local LLMs yet.)
- > you can use 'true' and 'false' interchangeably.
made me laugh, especially in the context of LLMs.
- Using llama.cpp and the 9b q4 xl model, it is on Thinking mode by default and runs without stopping. The only way to force it to stop is to set the thinking budget to -1. (Which is weird as the docs say 0 should be valid)
by singpolyma3
1 subcomments
- Does anyone know what the quantization is with ollama models? They always just list parameter count.
I'm also a bit unsure of the trade offs between smaller quant vs smaller model
- I wanted to submit a fix to the site as I couldn't compile llama.cpp without `sudo apt install nvidia-cuda-toolkit-gcc`. Anyone know where to do that?
- Will it run on an old 4xV100 Tesla rig? Looking something to start with, this can be available, but too inexperienced to understand all fp* nuances.
- It's also working in Ollama now. The 27B model is absolutely cracked on an RTX 3090. Feels close to frontier American models for writing code.
- It's truly an amazing model from the small models all the way to 397B. I wish they had released one as a FIM model.
- Qwen 3.5 is a really good local model. I'm using it with personal assistant(https://github.com/daegwang/atombot) every day!
- Local models, particularly the new ones would be really useful in many situations. They are not for general chat but if tools use them in specific agents, the results are awesome.
I built https://github.com/brainless/dwata to submit for Google Gemini Hackathon, and focused on an agent that would replace email content with regex to extract financial data. I used Gemini 3 Flash.
After submitting to the contest, I kept working on branch: reverse-template-based-financial-data-extraction to use Ministral 3:3b. I moved away from regex detection to a reverse template generation. Like Jinja2 syntax but in reverse, from the source email.
Financial data extraction now works OK ish and I am constantly improving this to aim for a launch soon. I will try with Qwen 3.5 Small, maybe 4b model. Both Ministral 3:3b and Qwen 3.5 Small:4b will fit on the smallest Mac Mini M4 or a RTX 3060 6GB (I have these devices). dwata should be able to process all sorts of financial data, transaction and meta-data (vendor, reference #), at a pretty nice speed. Keep it running a couple hours and you can go through 20K or 30K emails. All local!
- Qwen3.5-27B works amazingly well with https://swival.dev now that the unsloth quants have fixed the tool calling issues.
I still like and mainly use Qwen3-Coder-Next, though, as it seems to be generally more reliable.
- I had an annoying issue in a setup with two Nvidia L4 cards where trying to run the MoE versions to get decent performance just didn't work with Ollama, seems the same as these:
https://github.com/ollama/ollama/issues/14419
https://github.com/ollama/ollama/issues/14503
So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!
- I have it running locally, but speed is a problem. I have the 35GB model running on a PC with 64GB, a fairly new processor and a mid-level GPU. Ask a question, go drink a coffee.
I mean, it's great that so many models are open-source and readily available. That is hugely important. Running models locally protects your data. But speed is a problem, and likely to remain a problem for the foreseeable future.
- 1. how creating image on small 7-12B LLM
2. how creating a voice
3. how earning bilion dolars in 2 week?
- a clear guide. thanks for that.
by aplomb1026
0 subcomment
- [dead]
- [dead]