gemma-3n-E4B-it-Q8_0 import_cuda_impl: initializing gpu module... get_rocm_bin_path: note: hipcc not found on $PATH [...] llama_model_load: error loading model: error loading model architecture: unknown model architecture: 'gemma3n' llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model 'gemma-3n-E4B-it-Q8_0.gguf' main: error: unable to load model
"Gemini Nano allows you to deliver rich generative AI experiences without needing a network connection or sending data to the cloud." -- replace Gemini with Gemma and the sentence still valid.
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E4B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
./llama.cpp/llama-cli -hf unsloth/gemma-3n-E2B-it-GGUF:UD-Q4_K_XL -ngl 99 --jinja --temp 0.0
I'm also working on an inference + finetuning Colab demo! I'm very impressed since Gemma 3N has audio, text and vision! https://docs.unsloth.ai/basics/gemma-3n-how-to-run-and-fine-...
Does anybody know how to actually run these using MLX? mlx-lm does not currently seem to support them, so I wonder what Google means exactly by "MLX support".
I solved my spam problem with gemma3:27b-it-qat, and my benchmarks show that this is the size at which the current models start becoming useful.
However it's still 8B parameters and there are no quantized models just yet.
Until it goes into the inner details (MatFormer, per-layer embeddings, caching...), the only sentence I've found that concretely mentions a new thing is "the first model under 10 billion parameters to reach [an LMArena score over 1300]". So it's supposed to be better than other models until those that use 10GB+ RAM, if I understand that right?
Though I can imagine a few commercial applications where something like this would be useful. Maybe in some sort of document processing pipeline.
I think it’s something that even Google should consider: publishing open-source models with the possibility of grounding their replies in Google Search.
Cherry-picking something that's quick to evaluate:
"High throughput: Processes up to 60 frames per second on a Google Pixel, enabling real-time, on-device video analysis and interactive experiences."
You can download an APK from the official Google project for this, linked from the blogpost: https://github.com/google-ai-edge/gallery?tab=readme-ov-file...
If I download it, run it on Pixel Fold, actual 2B model which is half the size of the ones the 60 fps claim is made for, it takes 6.2-7.5 seconds to begin responding (3 samples, 3 diff photos). Generation speed is shown at 4-5 tokens per second, slightly slower than what llama.cpp does on my phone. (I maintain an AI app that inter alia, wraps llama.cpp on all platforms)
So, *0.16* frames a second, not 60 fps.
The blog post is so jammed up with so many claims re: this is special for on-device and performance that just...seemingly aren't true. At all.
- Are they missing a demo APK?
- Was there some massive TPU leap since the Pixel Fold release?
- Is there a lot of BS in there that they're pretty sure won't be called out in a systematic way, given the amount of effort it takes to get this inferencing?
- I used to work on Pixel, and I remember thinking that it seemed like there weren't actually public APIs for the TPU. Is that what's going on?
In any case, either:
A) I'm missing something, big or
B) they are lying, repeatedly, big time, in a way that would be shown near-immediately when you actually tried building on it because it "enables real-time, on-device video analysis and interactive experiences."
Everything I've seen the last year or two indicates they are lying, big time, regularly.
But if that's the case:
- How are they getting away with it, over this length of time?
- How come I never see anyone else mention these gaps?
What's interesting, that it beats smarter models in my Turing Test Battle Royale[1]. I wonder if it means it is a better talker.
Maybe you could install it on YouTube, where my 78-year-old mother received a spammy advert this morning from a scam app pretending to be an iOS notification.
Kinda sick of companies spending untold billions on this while their core product remains a pile of user-hostile shite. :-)
I am posting again because I've been here 16 years now, it is very suspicious that happened, and given the replies to it, we now know this blog post is false.
There is no open model that you can download today and run at even 1% of the claims in the blog post.
You can read a reply from someone indicating they have inside knowledge on this, who notes this won't work as advertised unless you're Google (i.e. internally, they have it binding to a privileged system process that can access the Tensor core, and this isn't available to third parties. Anyone else is getting 1/100th of the speeds in the post)
This post promises $150K in prizes for on-device multimodal apps and tells you it's running at up to 60 fps, they know it runs at 0.1 fps, Engineering says it is because they haven't prioritized 3rd parties yet, and somehow, Google is getting away with this.