FRESH

Hacker News

Gemma 4 QAT models: Optimizing compression for mobile and laptop efficiency

395 points by theanonymousone

by simonw

3 subcomments

I just ran one of these locally on a Mac like this:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu \
    --prompt="Generate an SVG of a pelican riding a bicycle"

The first time you run that it downloads 3.2GB to ~/.cache/huggingface/hub/models--litert-community--gemma-4-E2B-it-litert-lm

It can handle audio and image input too, which is pretty cool for a 3.2GB model. For images:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --vision-backend gpu \
    --attachment image.jpg --prompt describe

And for audio:

  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
    --backend=gpu --audio-backend cpu \
    --attachment audio.wav --prompt transcribe

(The pelican is rubbish, but it's only a 3.2GB file so the fact it even outputs valid SVG is impressive to me: https://gist.github.com/simonw/94b318afde4b1ce5ff67d4b5d0362... )

by satvikpendem

4 subcomments

Unsloth's collection as well [0], with their results [1]. Looks like they can get very close to 100% accuracy compared to the BF16 model that is unquantized, and Unsloth's quants are better than the original Google's QAT as posted in the article.
Personal I'm using the 2B model for web search and structured JSON output back via Unsloth Studio and its API, works very well for that even with the model embedded on phones.
[0] https://huggingface.co/collections/unsloth/gemma-4-qat
[1] https://unsloth.ai/docs/models/gemma-4/qat#qat-analysis

by jhatax

3 subcomments

It’s the Friday before WWDC during which Apple is going to announce an “improved” Siri based on Google models (a locked partnership, for now). Maybe it’s a coincidence, but this might be Google releasing models that will be showcased next week by Apple?
No knowledge, just speculation.

by jbarrow

0 subcomment

Very impressed with how much the Gemma ecosystem has advanced just this week.
Gemma 12B, multitoken prediction, and official quants released. Feels like Google is putting real effort into this string of releases, and I'm very excited to see that!

by minimaxir

2 subcomments

It's a bit awkward to release Gemma 4 12B (https://news.ycombinator.com/item?id=48385906), and then a canonical Q4_0 Gemma 4 12B a couple days later.
It's good that this post lists the expected VRAM usage for the models with Q4_0 Gemma 4 12B being 6.7GB, which will indeed fit Google's claims of fitting within 16GB comfortably, altough it confirms that only the quantized version will do so.
Relatedly, in Google's newly released Edge Gallery for macOS, Gemma 4 12B is explicitly listed as unsupported due to not enough RAM even on a 16GB machine, but given the expected VRAM usage here the Q4_0 variant definitely should fit and Google should fix that.

by RandyOrion

1 subcomments

From the perspective of a local llm user, I think the qat doesn't solve the major problem of the gemma models.
Gemma family (gen 1 to gen 4) is consistent with extreme range of activations, i.e., 600000, essentially forcing people to use bf16 kv cache and accept a short context window, e.g., 31b, iq4_xs quantization, 100k context window on 32gb memory. Or, people use q8 kv cache, 200k context window, and accept a large performance penalty.
In contrast, for qwen 3.5 family, the largest activation is below 2000, making q8 or even lower-precision kv cache essentially free estates. Together with linear attention, which doesn't require kv cache, full 262k context window can be easily reached.
Qat training with w4a16 target, while improving performance on inference with low-precision weighs, doesn't solve kv cache problem at all.
In the end, a qat is a qat, and there are unseen efforts behind qat checkpoints. Thank you gemma team for releasing qat checkpoints.

by taffydavid

1 subcomments

Noob q: can advancements like this targeted at local inference have bonus effects for cloud inference? Presumably if you can get great results on cheaper hardware that also equates to less resource usage on cutting edge hardware, and less power draw?
Will advancements like this ultimately reduce the carbon footprint of AI?

by Catloafdev

0 subcomment

Being able to run the 12B on 8gb VRAM is huge. It's crazy to see how fast these small local models have evolved.

by netdur

2 subcomments

had a good run with Gemma 4 E2B Unsloth 4Q: https://youtube.com/shorts/XLsAnz5aAAI
The E4B model doesn’t fit on my phone TPU, so it swaps to RAM, the QAT version means more accuracy, good!

by jack_pp

0 subcomment

Ran hf.co/google/gemma-4-12B-it-qat-q4_0-gguf:Q4_0 with ollama on a AMD Ryzen 9 8940HX, NVIDIA GeForce RTX 5060 (8 GB), 14 GB RAM laptop and it is suprisingly fast

by WhiteDawn

2 subcomments

Once someone generates a MTP layer for 26B A4B 4 QAT I'll be singing from the hills with my 5 year old GPU.

by arjun-mavonic

0 subcomment

Yet to try this. But from what I heard from a friend is that Gemma 4 12b calls same tool’s repeatedly. Maybe harness can be made to handle it.

by somewhatrandom9

1 subcomments

Could these quantized models make MTP (Multi-Token Prediction) significantly faster when used as drafters for larger regular Gemma 4 models?

0 subcomment

by cr3cr3

1 subcomments

For a moment I got excited thinking QAT is Intel Quick Assist Technology...

by superkuh

1 subcomments

I wish they would release the base (non instruction tuned) models for use with pattern completion.

by nicman23

0 subcomment

the new 4 12b model replaced qwen3.6 27b for me. the task i am doing is a bit specific, validating if a stamp has the correct name but the ones that it could not see maybe a 30 percent were easily discerned.

by nazgul17

0 subcomment

I don't see these QAT models on Edge Gallery; just the BF16 models are there. Is there anything I am missing?

by zkmon

1 subcomments

How can the smaller Unsloth GGUF quant can beat the original google quant? (ref: unsloth/gemma-4-31B-it-qat-GGUF)

0 subcomment

by redox99

1 subcomments

I was just testing Gemma E2B and E4B yesterday, and they are just too dumb to be useful outside of niche use cases.
Besides, there's no good agent on Android. Having a model that can't run web searches and browse websites is limited in use, particularly small models that really need to be grounded on search results to be factual, because they can't memorize enough.
Edit: I'd like to know what kind of usage the people that seem to disagree and downvoted this are having.

by Kylejeong21

0 subcomment

google pixel intelligence may beat apple intelligence

by refulgentis

1 subcomments

@google.com'ers, there are no GGUFs (blog says there is)

by comparedge

0 subcomment

[flagged]

by Pixel-Labs

0 subcomment

[flagged]

by spacebacon

0 subcomment

[flagged]

by steno132

6 subcomments

I don't get this obsession with smaller models. I've been using Claude and GPT models for years and have had zero issues with them.
I see absolutely no benefit to me as a end user for a local model which is going to take up more of my CPU and memory and slow down my machine. I almost always have Internet and if I don't then not having access to a AI model is the least of my concerns.