- the interactive devices - all the alexa/google/apple devices out there are this interface, also, probably some TV input that stays local and I can voice control. That kind of thing. It should have a good speaker and voice control. It probably should also do other things like act as a wifi range extender or be the router. That would actually be good. I would buy one for each room so no need for crazy antennas if they are close and can create true mesh network for me. But I digress.
- the home 'cloud' server that is storage and control. This is a cheap CPU, a little ram and potentially a lot of storage. It should hold the 'apps' for my home and be the one place I can back-up everything about my network (including the network config!)
- the inference engines. That is where this kind of repo/device combo comes in. I buy it and it knows how to advertise in a standard way its services and the controlling node connects it to the home devices. It would be great to just plug it in and go.
Of course all of these could be combined but conceptually I want to be able to swap and mix and match at these levels so options here and interoperability is what really matters.
I know a lot of (all of) these pieces exist, but they don't work well together. There isn't a simple standard 'buy this turn it on and pair with your local network' kind of plug and play environment.
My core requirements are really privacy and that it starts taking over the unitaskers/plays well together with other things. There is a reason I am buying all this local stuff. If you phone home/require me to set up an account with you I probably don't want to buy your product. I want to be able to say 'Freddy, set timer for 10 mins' or 'Freddy, what is the number one tourist attraction in South Dakota' (wall drugs if you were wondering)
> On a Pi 5 (16GB), Q3_K_S-2.70bpw [KQ-2] hits 8.03 TPS at 2.70 BPW and maintains 94.18% of BF16 quality.
And they talk about other hardware and details. But that's the expanded version of the headline claim.
If you have very specific, constrained tasks it can do quite a lot. It's not perfect though.
https://tools.nicklothian.com/llm_comparator.html?gist=fcae9... is an example conversation where I took OpenAI's "Natural language to SQL" prompt[1], send it to Ollama:qwen3:0.6b and the asked Gemini Flash 3 to compare what qwen3:0.6b did vs what Flash did.
Flash was clearly correct, but the qwen3:0.6b errors are interesting in themselves.
[1] https://platform.openai.com/docs/examples/default-sql-transl...
That got me thinking again about what practical even means when it comes to AI running on the edge. Like, away from big servers.
I came up with a basic way to look at it. First, capability. That is, what kinds of tasks can it handle decently. Then latency. Does it respond quick enough so it does not feel laggy. There are also constraints to consider. Things like power use, how much memory it needs, and heat buildup.
The use case part seems key too. What happens if you try to take it off the cloud and run it locally. In my experience, a lot of these edge AI demos fall short there. The tech looks impressive, but it is hard to see why you really need it that way.
It seems like most people overlook that unclear need. I am curious about how others see it. Local inference probably beats cloud in spots where you cannot rely on internet, or maybe for privacy reasons. Or when data stays on device for security.
Some workloads feel close right now. They might shift soon as hardware gets better. I think stuff like voice assistants or simple image recognition could tip over.
If someone has actually put a model on limited hardware, like in a product, what stood out as a surprise. The thermals maybe, or unexpected power drains. It feels like that part gets messy in practice. I might be oversimplifying how tricky it all is.
./build/bin/llama-cli -m "models/Qwen3-30B-A3B-Instruct-2507-Q3_K_S-2.70bpw.gguf" -e --no-mmap -t 4
...
Loading model... -ggml_aligned_malloc: insufficient memory (attempted to allocate 24576.00 MB)
ggml_backend_cpu_buffer_type_alloc_buffer: failed to allocate buffer of size 25769803776
alloc_tensor_range: failed to allocate CPU buffer of size 25769803776
llama_init_from_model: failed to initialize the context: failed to allocate buffer for kv cache
Segmentation fault
I'm not sure how they're running it... any kind of guide for replicating their results? It does take up a little over 10 GB of RAM (watching with btop) before it segfaults and quits.[Edit: had to add -c 4096 to cut down the context size, now it loads]
For anyone interested in a comparative review of different models that can run on a Pi, here’s a great article [1] I came across while working on my project.
[0] https://github.com/syxanash/maxheadbox
[1] https://www.stratosphereips.org/blog/2025/6/5/how-well-do-ll...
Original: 11tok/s Byteshape: 16tok/s
Quite a nice improvement!
Going from BF16 to 2.8 and losing only ~5% sounds odd to me.
Eight tokens per second is "real time" in that sense, but that's also the kind of speeds that we used to mock old video games for, when they would show "computers" but the text would slowly get printed to a screen letter for letter or word for word.