- Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.
Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.
The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.
- I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.
I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.
Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.
I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.
by kristianp
2 subcomments
- Hmm, I'm unable to order more than 96GB RAM for a Mac studio, even with the M3 ultra or M4 Max. Is this au specific? However with the MacBook Pro I can specify 128GB with the M5 Mac.
https://www.apple.com/au/shop/buy-mac/mac-studio
by ZeroGravitas
0 subcomment
- Did I miss a simple motivating benchmark or goal?
I'm assuming this is faster, and/or lets you run a bigger, smarter model than just using the generic tool chain, but it doesn't spell out the level of existing improvements over that baseline or expected improvements as far as I can see?
Presumably you can work it out based on the numbers given if you have the relevant comparison values.
- This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.
- A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...
by danborn26
1 subcomments
- It is great to see antirez working on local inference. A lightweight implementation for Metal makes running these models locally much more accessible.
by shivnathtathe
0 subcomment
- Been working on local-first LLM observability for exactly this use case — tracing local model pipelines without sending data to cloud. Happy to share if anyone's interested.
- Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.
Edit: Caching story makes a lot more sense for regular usage:
> Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.
by nazgulsenpai
1 subcomments
- I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.
- Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.
- The beaty of it, that you can clone and make it, and it just works, no python shenanigans, what a blessing for this eco system.
- Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff
- I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.
- Finally someone who pays proper respect to GGML ecosystem.
by sourcecodeplz
0 subcomment
- Great project!
This is also a fine example of a vibe-coded project with purpose, as you acknowledged.
by brcmthrowaway
0 subcomment
- How does this compare with oMLX?
by andrefelipeafos
0 subcomment
- [flagged]
- [flagged]
- [dead]
by happyPersonR
4 subcomments
- So just gonna ask a question, probably will get downvoted
I know this is flash, but….
But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?
Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK
Sidenote: shout out antirez love my redis :)