FRESH

Hacker News

Home

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

340 points by b4rtazz

by rldjbpin

0 subcomment

how would llm-d [1] work compared to distributed-llama? is the overhead or configuration too much to work with for simple setups?
[1] https://github.com/llm-d/llm-d/

by dingdingdang

3 subcomments

Very impressive numbers.. wonder how this would scale on 4 relatively modern desktop PCs, like say something akin to a i5 8th Gen Lenovo ThinkCentre, these can be had for very cheap. But like @geerlingguy indicates - we need model compatibility to go up up up! As an example it would amazing to see something like fastsdcpu run distributed to democratize accessibility-to/practicality-of image gen models for people with limited budgets but large PC fleets ;)

by rao-v

2 subcomments

Nice! Cheap RK3588 boards come with 15GB of LPDDR5 RAM these days and have significantly better performance than the Pi 5 (and often are cheaper).
I get 8.2 tokens per second on a random orange pi board with Qwen3-Coder-30B-A3B at Q3_K_XL (~12.9GB). I need to try two of them in parallel ... should be significantly faster than this even at Q6.

by behnamoh

8 subcomments

Everything runs on a π if you quantize it enough!
I'm curious about the applications though. Do people randomly buy 4xRPi5s that they can now dedicate to running LLMs?

by drclegg

0 subcomment

Distributed compute is cool, but $320 for 13 tokens/s on a tiny input prompt, 4 bit quantization, and 3B active parameter model is very underwhelming

by tarruda

0 subcomment

I suspect you'd get similar numbers with a modern x86 mini PC that has 32GB of RAM.

by geerlingguy

1 subcomments

distributed-llama is great, I just wish it would work with more models. I've been happy with ease of setup and its ongoing maintenance compared to Exo, and performance vs llama.cpp RPC mode.

by bjt12345

0 subcomment

Does Distributed Llama use RDMA over Converged Ethernet or is this roadmapped? I've always wondered if RoCE and Ultra-Ethernet will trickle down into the consumer market.

by mmastrac

1 subcomments

Is the network the bottleneck here at all? That's impressive for a gigabit switch.

by poly2it

0 subcomment

Neat, but at this price scaling it's probably better to buy GPUs.

by echelon

7 subcomments

This is really impressive.
If we can get this down to a single Raspberry Pi, then we have crazy embedded toys and tools. Locally, at the edge, with no internet connection.
Kids will be growing up with toys that talk to them and remember their stories.
We're living in the sci-fi future. This was unthinkable ten years ago.

by ab_testing

0 subcomment

Would it work better on a used GPU?

by varispeed

3 subcomments

So would 40x RPi 5 get 130 token/s?

by kosolam

1 subcomments

How is this technically done? How does it split the query and aggregates the results?

by ineedasername

0 subcomment

This is highly usable in an enterprise setting when the task benefits from near-human level decision making and when $acceptable_latency < 1s meets decisions that can be expressed in natural language <= 13tk.
Meaning that if you can structure a range of situations and tasks clearly in natural language with a pseudo-code type of structure and fit it in model context then you can have an LLM perform a huge amount of work with Human-in-the-loop oversight & quality control for edge cases.
Think of office jobs, white colar work, where, business process documentation and employee guides and job aids already fully describe 40% to 80% of the work. These are the tasks most easily structured with scaffolding prompts and more specialized RLHF enriched data, and then perform those tasks more consistently.
This is what I decribe when I'm asked "But how will they do $X when they can't answer $Y without hallucinating?"
I explain the above capability, then I ask the person to do a brief thought experiment: How often have you heard, or yourself thought something like, "That is mindnumbingly tedious" and/or "a trained monkey could do it"?
In the end, I don't know anyone whose is aware of the core capabilities in the structured natural-language sense above, that doesn't see at a glance just how many jobs can easily go away.
I'm not smart enough to see where all the new jobs will be or certain there will be as many of them, if I did I'd start or invest in such businesses. But maybe not many new jobs get created, but then so what?
If the net productivity and output-- essentially the wealth-- of the global workforce remains the same or better with AI assistance and therefore fewer work hours, that means... What? Less work on average, per capita. More wealth, per work hour worked per Capita than before.
Work hours used to be longer, they can shorten again. The problem is getting there. To overcoming not just the "sure but it will only be the CEOs get wealthy" side of things to also the "full time means 40 hours a week minimum." attitude by more than just managers and CEOs.
It will also mean that our concept of the "proper wage" for unskilled labor that can't be automated will have to change too. Wait staff at restaurants, retail workers, countless low end service-workers in food and hospitality? They'll now be providing-- and giving up-- something much more valuable than white colar skills that are outdated. They'll be giving their time to what I've heard, and the term is jarring to my ears but it is what it is, I've heard it described as "embodied work". And I guess the term fits. And anyway I've long considered my time to be something I'll trade with a great deal more reluctance than my money, and so demand a lot money for it when it's required so I can use that money to buy more time (by not having to work) somewhere in the near future, even if it's just by covering my costs for getting groceries delivered instead of the time to go shopping myself.
Wow, this comment got away from me. But seeing Qwen3 30B level quality with 13tk/s on dirt cheap HW struck a deep chord of "heck, the global workforce could be rocked to the core for cheap+quality 13tk/s." And that alone isn't the sort of comment you can leave as a standalone drive-by on HN and have it be worth the seconds to write it. And I'm probably wrong on a little or a lot of this and seeing some ideas on how I'm wrong will be fun and interesting.

by shaaca

0 subcomment

[dead]

by YJfcboaDaJRDw

0 subcomment

[dead]

by mehdibl

1 subcomments

[flagged]

by misternintendo

2 subcomments

At this speed this is only suitable for time insensitive applications..