FRESH

Hacker News

Google releases Gemma 4 open models

1516 points by jeffmcjunkin

by danielhanchen

19 subcomments

Thinking / reasoning + multimodal + tool calling.
We made some quants at https://huggingface.co/collections/unsloth/gemma-4 for folks to run them - they work really well!
Guide for those interested: https://unsloth.ai/docs/models/gemma-4
Also note to use temperature = 1.0, top_p = 0.95, top_k = 64 and the EOS is "<turn|>". "<|channel>thought\n" is also used for the thinking trace!

by simonw

8 subcomments

I ran these in LM Studio and got unrecognizable pelicans out of the 2B and 4B models and an outstanding pelican out of the 26b-a4b model - I think the best I've seen from a model that runs on my laptop.
https://simonwillison.net/2026/Apr/2/gemma-4/
The gemma-4-31b model is completely broken for me - it just spits out "---\n" no matter what prompt I feed it. I got a pelican out of it via the AI Studio API hosted model instead.

by scrlk

3 subcomments

Comparison of Gemma 4 vs. Qwen 3.5 benchmarks, consolidated from their respective Hugging Face model cards:

    | Model          | MMLUP | GPQA  | LCB   | ELO  | TAU2  | MMMLU | HLE-n | HLE-t |
    |----------------|-------|-------|-------|------|-------|-------|-------|-------|
    | G4 31B         | 85.2% | 84.3% | 80.0% | 2150 | 76.9% | 88.4% | 19.5% | 26.5% |
    | G4 26B A4B     | 82.6% | 82.3% | 77.1% | 1718 | 68.2% | 86.3% |  8.7% | 17.2% |
    | G4 E4B         | 69.4% | 58.6% | 52.0% |  940 | 42.2% | 76.6% |   -   |   -   |
    | G4 E2B         | 60.0% | 43.4% | 44.0% |  633 | 24.5% | 67.4% |   -   |   -   |
    | G3 27B no-T    | 67.6% | 42.4% | 29.1% |  110 | 16.2% | 70.7% |   -   |   -   |
    | GPT-5-mini     | 83.7% | 82.8% | 80.5% | 2160 | 69.8% | 86.2% | 19.4% | 35.8% |
    | GPT-OSS-120B   | 80.8% | 80.1% | 82.7% | 2157 |  --   | 78.2% | 14.9% | 19.0% |
    | Q3-235B-A22B   | 84.4% | 81.1% | 75.1% | 2146 | 58.5% | 83.4% | 18.2% |  --   |
    | Q3.5-122B-A10B | 86.7% | 86.6% | 78.9% | 2100 | 79.5% | 86.7% | 25.3% | 47.5% |
    | Q3.5-27B       | 86.1% | 85.5% | 80.7% | 1899 | 79.0% | 85.9% | 24.3% | 48.5% |
    | Q3.5-35B-A3B   | 85.3% | 84.2% | 74.6% | 2028 | 81.2% | 85.2% | 22.4% | 47.4% |

    MMLUP: MMLU-Pro
    GPQA: GPQA Diamond
    LCB: LiveCodeBench v6
    ELO: Codeforces ELO
    TAU2: TAU2-Bench
    MMMLU: MMMLU
    HLE-n: Humanity's Last Exam (no tools / CoT)
    HLE-t: Humanity's Last Exam (with search / tool)
    no-T: no think

by lousken

0 subcomment

The speed is complete poopoo, even on their API. To spend 5 seconds thinking about "hello how you doin" prompt on their TPUs is insane and something must be wrong with this model.

by neonstatic

6 subcomments

Prompt:
> what is the Unix timestamp for this: 2026-04-01T16:00:00Z
Qwen 3.5-27b-dwq
> Thought for 8 minutes 34 seconds. 7074 tokens.
> The Unix timestamp for 2026-04-01T16:00:00Z is:
> 1775059200 (my comment: Wednesday, 1 April 2026 at 16:00:00)
Gemma-4-26b-a4b
> Thought for 33.81 seconds. 694 tokens.
> The Unix timestamp for 2026-04-01T16:00:00Z is:
> 1775060800 (my comment: Wednesday, 1 April 2026 at 16:26:40)
Gemma considered three options to solve this problem. From the thinking trace:
> Option A: Manual calculation (too error-prone).
> Option B: Use a programming language (Python/JavaScript).
> Option C: Knowledge of specific dates.
It then wrote a python script:
```
  from datetime import datetime, timezone
  date_str = "2026-04-01T16:00:00Z"
  # Replace Z with +00:00 for ISO format parsing or just strip it
  dt = datetime.strptime(date_str, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)
  ts = int(dt.timestamp())
  print(ts)
```
Then it verified the timestamp with a command:
```
  date -u -d @1775060800
```
All of this to produce a wrong result. Running the python script it produced gives the correct result. Running the verification date command leads to a runtime error (hallucinated syntax). On the other hand Qwen went straight to Option A and kept overthinking the question, verifying every step 10 times, experienced a mental breakdown, then finally returned the right answer. I think Gemma would be clearly superior here if it used the tools it came up with rather than hallucinating using them.

by canyon289

20 subcomments

Hi all! I work on the Gemma team, one of many as this one was a bigger effort given it was a mainline release. Happy to answer whatever questions I can

by chrislattner

2 subcomments

If you want the fastest open source implementation on Blackwell and AMD MI355, check out Modular's MAX nightly. You can pip install it super fast, check it out here: https://www.modular.com/blog/day-zero-launch-fastest-perform...
-Chris Lattner (yes, affiliated with Modular :-)

by antirez

6 subcomments

Featuring the ELO score as the main benchmark in chart is very misleading. The big dense Gemma 4 model does not seem to reach Qwen 3.5 27B dense model in most benchmarks. This is obviously what matters. The small 2B / 4B models are interesting and may potentially be better ASR models than specialized ones (not just for performances but since they are going to be easily served via llama.cpp / MLX and front-ends). Also interesting for "fast" OCR, given they are vision models as well. But other than that, the release is a bit disappointing.

by NitpickLawyer

0 subcomment

Best thing is that this is Apache 2.0 (edit: and they have base models available. Gemma3 was good for finetuning)
The sizes are E2B and E4B (following gemma3n arch, with focus on mobile) and 26BA4 MoE and 31B dense. The mobile ones have audio in (so I can see some local privacy focused translation apps) and the 31B seems to be strong in agentic stuff. 26BA4 stands somewhere in between, similar VRAM footprint, but much faster inference.

by originalvichy

4 subcomments

The wait is finally over. One or two iterations, and I’ll be happy to say that language models are more than fulfilling my most common needs when self-hosting. Thanks to the Gemma team!

by swalsh

3 subcomments

I gave the same prompt (a small rust project that's not easy, but not overly sophisticated) to both Gemma-4 26b and Qwen 3.5 27b via OpenCode. Qwen 3.5 ran for a bit over an hour before I killed it, Gemma 4 ran for about 20 minutes before it gave up. Lots of failed tool calls.
I asked codex to write a summary about both code bases.
"Dev 1" Qwen 3.5
"Dev 2" Gemma 4
Dev 1 is the stronger engineer overall. They showed better architectural judgment, stronger completeness, and better maintainability instincts. The weakness is execution rigor: they built more, but didn’t verify enough, so important parts don’t actually hold up cleanly.
Dev 2 looks more like an early-stage prototyper. The strength is speed to a rough first pass, but the implementation is much less complete, less polished, and less dependable. The main weakness is lack of finish and technical rigor.
If I were choosing between them as developers, I’d take Dev 1 without much hesitation.
Looking at the code myself, i'd agree with codex.

by nl

2 subcomments

Gemma-4-E4B-it scored 15/25 on my https://sql-benchmark.nicklothian.com/#all-data (agentic SQL generation).
The naming is a bit odd - E4B is "4.5B effective, 8B with embeddings", so despite the name it is probably best compared with the 8B/9B class models and is competitive with them.
Qwen3.5-9B also scores 15/25 in thinking mode for example. The best 9B model I've found is Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2 which gets to 17/25
gemma-4-E2B (4bit quant) scored 12/25, but is really a 5B model. That's the same as NVIDIA-Nemotron-3-Nano-4B which is the best 4B model I've found (yes, better than Qwen 4B).
That's a great score for a small model.

by minimaxir

1 subcomments

The benchmark comparisons to Gemma 3 27B on Hugging Face are interesting: The Gemma 4 E4B variant (https://huggingface.co/google/gemma-4-E4B-it) beats the old 27B in every benchmark at a fraction of parameters.
The E2B/E4B models also support voice input, which is rare.

by d4rkp4ttern

1 subcomments

For token-generation speed, a challenging test is to see how it performs in a code-agent harness like Claude Code, which has anywhere between 15-40K tokens from the system prompt itself (+ tools/skills etc).

Here the 26B-A4B variant is head and shoulders above recent open-weight models, at least on my trusty M1 Max 64GB MacBook.

I set up Claude Code to use this variant via llama-server, with 37K tokens initial context, and it performs very well: ~40 tokens/sec, far better than Qwen3.5-35B-A3B, though I don't know yet about the intelligence or tool-calling consistency. Prompt processing speed is comparable to the Qwen variant at ~400 tok/s.

My informal tests, all with roughly 30K-37K tokens initial context:

    ┌────────────────────┬───────────────┬────────────┐
    │       Model        │ Active Params │ tg (tok/s) │
    ├────────────────────┼───────────────┼────────────┤
    │ Gemma-4-26B-A4B    │ 4B            │ ~40        │
    ├────────────────────┼───────────────┼────────────┤
    │ GPT-OSS-20B        │ 3.6B          │ ~17-38     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-30B-A3B      │ 3B            │ ~15-27     │
    ├────────────────────┼───────────────┼────────────┤
    │ GLM-4.7-Flash      │ 3B            │ ~12-13     │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3.5-35B-A3B    │ 3B            │ ~12        │
    ├────────────────────┼───────────────┼────────────┤
    │ Qwen3-Next-80B-A3B │ 3B            │ ~3-5       │
    └────────────────────┴───────────────┴────────────┘

Full instructions for running this and other open-weight models with Claude Code are here:

https://pchalasani.github.io/claude-code-tools/integrations/...

by Praxwise

0 subcomment

I just checked the status of the domain registrations and noticed that the domain squatters have already started taking action. Almost all of the domains have been registered.

by noritaka88

0 subcomment

Apache 2.0 is a big shift here.
Previous Gemma licenses made agent deployments (especially BYOK setups) a bit of a gray zone legally. This makes it much easier to run models like Gemma 4 as agent backends without worrying about downstream usage.
Also interesting from an agent perspective: the 26B MoE hitting #6 while activating ~4B params.
If you’re running multiple agents on a single machine, that kind of efficiency actually matters more than raw model size.

by mudkipdev

2 subcomments

Can't wait for gemma4-31b-it-claude-opus-4-6-distilled-q4-k-m on huggingface tomorrow

by Analog24

1 subcomments

So the "E2B" and "E4B" models are actually 5B and 8B parameters. Are we really going to start referring to the "effective" parameter count of dense models by not including the embeddings?
These models are impressive but this is incredibly misleading. You need to load the embeddings in memory along with the rest of the model so it makes no sense o exclude them from the parameter count. This is why it actually takes 5GB of RAM to run the "2B" model with 4-bit quantization according to Unsloth (when I first saw that I knew something was up).

by Igor_Wiwi

0 subcomment

I created a blog post specifically about running these models locally on your machine (1 liner but getting gguf may take some time): https://igorstechnoclub.com/running-gemma-4-locally-in-almos...

by Reubend

0 subcomment

I would suggest that people stop overfocusing on benchmarks, and give this a try. Gemma 4 is performing really well for me, and seems to hallucinate much less than other models I tried in this size range.

by try-working

0 subcomment

The biggest story here is that this is Google handing Qwen the SOTA crown for small and medium models.
For the first time ever, a Chinese lab is at the frontier. Google and Nvidia are significantly behind, not just on benchmarks but real-world performance like tool calling accuracy.

by karimf

2 subcomments

I'm curious about the multimodal capabilities on the E2B and E4B and how fast is it.
In ChatGPT right now, you can have a audio and video feed for the AI, and then the AI can respond in real-time.
Now I wonder if the E2B or the E4B is capable enough for this and fast enough to be run on an iPhone. Basically replicating that experience, but all the computations (STT, LLM, and TTS) are done locally on the phone.
I just made this [0] last week so I know you can run a real-time voice conversation with an AI on an iPhone, but it'd be a totally different experience if it can also process a live camera feed.
https://github.com/fikrikarim/volocal

by Deegy

2 subcomments

So what's the business strategy here?
Google is the only USA based frontier lab releasing open models. I know they aren't doing it out of the goodness of their hearts.

by ceroxylon

1 subcomments

Even with search grounding, it scored a 2.5/5 on a basic botanical benchmark. It would take much longer for the average human to do a similar write-up, but they would likely do better than 50% hallucination if they had access to a search engine.

by RandyOrion

0 subcomment

Thank you Gemma team for releasing small dense VLM(s).
The elo ranking [1] is too good to be true. I don't know why gemma-4-26b-a4b performs better than gemma-4-31b.
Also waiting for more bugfixes in llama.cpp, sglang and vllm to do proper evaluations.
[1] https://arena.ai/leaderboard/text/expert?license=open-source

by bertili

0 subcomment

The timing is interesting as Apple supposedly will distill google models in the upcoming Siri update [1]. So maybe Gemma is a lower bound on what we can expect baked into iPhones.
[1] https://news.ycombinator.com/item?id=47520438

by stevenhubertron

1 subcomments

Still pretty unusable on Raspberry Pi 5, 16gb despite saying its built for it, from the E4B model

  total duration:       12m41.34930419s
  load duration:        549.504864ms
  prompt eval count:    25 token(s)
  prompt eval duration: 309.002014ms
  prompt eval rate:     80.91 tokens/s
  eval count:           2174 token(s)
  eval duration:        12m36.577002621s
  eval rate:            2.87 tokens/s

Prompt: whats a great chicken breast recipe for dinner tonight?

by jwr

2 subcomments

Really looking forward to testing and benchmarking this on my spam filtering benchmark. gemma-3-27b was a really strong model, surpassed later by gpt-oss:20b (which was also much faster). qwen models always had more variance.

by aggregator-ios

0 subcomment

I tested the E2B and E4B models and they get close but inaccurate (non working) results when generating jq queries from natural language.
This is of importance to me as I work on https://jsonquery.app and would prefer to use a model that works well with browser inference.
gemma-4-26b-a4b-it and gemma-4-31b-it produced accurate results in a few of my tests. But those are 50-60GB in size. Chrome has a developer preview that bundles Gemini Nano (under 2GB) and it used to work really well, but requires a few switches to be manually switched on, and has recently gotten worse in quality when testing for jq generation.

by simonw

2 subcomments

Anyone figured out a recipe to run Gemma 4 E2B or E4B against audio files locally on a Mac?

by VadimPR

1 subcomments

Gemma 3 E4E runs very quick on my Samsung S26, so I am looking forward to trying Gemma 4! It is fantastic to have local alternatives to frontier models in an offline manner.

by mchusma

1 subcomments

For those curious, on openrouter this is $0.14 input and $0.40 output, or ballpark half of Gemini flash lite 3.1 (googles current cheapest current gen closed model)

by Retro_Dev

0 subcomment

I'm very pleased with the performance of the largest gemma4 model (which I tested through ollama). My singular data point on whether an LLM remembers things well is whether it can translate toki pona to (and from) English. I find it easy to evaluate because I know the language. This local LLM marks the first version that 1) doesn't hallucinate words - at least, for the largest model - and 2) uses common word-phrases that other toki pona speakers use, and most importantly 3) can actually run on my laptop.

by sigbottle

1 subcomments

There are so many heavy hitting cracked people like daniel from unsloth and chris lattner coming out of the woodworks for this with their own custom stuff.
How does the ecosystem work? Have things converged and standardized enough where it's "easy" (lol, with tooling) to swap out parts such as weights to fit your needs? Do you need to autogen new custom kernels to fix said things? Super cool stuff.

by chrischavez

0 subcomment

Went through the official blog and the developers post, no mention of TurboQuant anywhere. Google's own research team tested it on Gemma models for KV-cache compression to 3 bits, so it's surprising it's not mentioned in this release. Anyone know if it's baked in already or if we'd need to apply it ourselves? Would love to run the 26B MoE locally as a daily driver.

by burgerquizz

0 subcomment

I want to embed a lightweight local model to be used for my webapp to use it without thinking about token price. is there an acceptable way to do it today?

by vicchenai

0 subcomment

The 4B being this capable is honestly surprising. Ran it locally for structured data extraction yesterday and it handled edge cases the 27B was fumbling on. Didn't expect to swap down that fast.

by fooker

1 subcomments

What's a realistic way to run this locally or a single expensive remote dev machine (in a vm, not through API calls)?

by wg0

5 subcomments

Google might not have the best coding models (yet) but they seem to have the most intelligent and knowledgeable models of all especially Gemini 3.1 Pro is something.
One more thing about Google is that they have everything that others do not:
1. Huge data, audio, video, geospatial 2. Tons of expertise. Attention all you need was born there. 3. Libraries that they wrote. 4. Their own data centers and cloud. 4. Most of all, their own hardware TPUs that no one has.
Therefore once the bubble bursts, the only player standing tall and above all would be Google.

0 subcomment

by gslepak

0 subcomment

"casually dropping the most capable open weights on the planet" — @RyanMullins
Google folks do something really cool!
Gemma4 source: https://github.com/huggingface/transformers/pull/45192

by babelfish

1 subcomments

Wow, 30B parameters as capable as a 1T parameter model?

by ggnore7452

0 subcomment

too bad that only the smaller on-device models support native audio input.

by i386

0 subcomment

You can try this new model live using mesh-llm right now: https://www.anarchai.org/dashboard

by screenshotapi

0 subcomment

I love how they have both the 31B dense and 26B MoE, both fit well locally. Any MLX ports already?

by flakiness

0 subcomment

It's good they still have non-instruction-tuned models.

by kuboble

0 subcomment

Im really looking forward to trying it out.
Gemma 3 was the first model that I have liked enough to use a lot just for daily questions on my 32G gpu.

by whhone

0 subcomment

The LiteRT-LM CLI (https://ai.google.dev/edge/litert-lm/cli) provides a way to try the Gemma 4 model.

  # with uvx
  uvx litert-lm run \
    --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
    gemma-4-E2B-it.litertlm

by darshanmakwana

0 subcomment

This is awesome! I will try to use them locally with opencode and see if they are usable inreplacement of claude code for basic tasks

by logicallee

0 subcomment

If anyone here is interested in its creative writing style, I gave both the 10 GB and 20 GB models the prompt "write a short story", here the results: [1]
They don't really have the structure of a short story, though the 20 GB model is more interesting and has two characters rather than just one character.
In another comment, I gave them coding tasks, if you want to see how fast it does at coding (on a 24 GB Mac Mini M4 with 10 cores) you can watch me livestream this here: [2]
Both models completed the fairly complex coding task well.
[1] https://pastebin.com/ZcWv6Hkb
[2] https://www.youtube.com/live/G5OVcKO70ns

by bearjaws

0 subcomment

The labels on the table read "Gemma 431B IT" which reads as 431B parameter model, not Gemma 4 - 31B...

by hikarudo

0 subcomment

Also checkout Deepmind's "The Gemma 4 Good Hackathon" on kaggle:
https://www.kaggle.com/competitions/gemma-4-good-hackathon

by stephbook

0 subcomment

Kind of sad they didn't release stronger versions. $dayjob offers strong NVidias that are hungry for models and are stuck running llama, gpt-oss etc.
Seems like Google and Anthropic (which I consider leaders) would rather keep their secret sauce to themselves – understandable.

by oblio

0 subcomment

How do these compare to Open AI OSS?

by popinman322

0 subcomment

Does anyone know whether we'll be receiving transcoders for this batch of models? We got them for Gemma 3, but maybe that was a one-off.

by yalogin

0 subcomment

Do these come in quantized variants too? I mean may be 10B or lower? Wonder how they function.

by james2doyle

2 subcomments

Hmm just tried the google/gemma-4-31B-it through HuggingFace (inference provider seems to be Novita) and function/tool calling was not enabled...

by 0xbadcafebee

1 subcomments

Gemma 3 models were pretty bad, so hopefully they got Gemma 4 to at least come close to the other major open weights

by synergy20

0 subcomment

a dumb question, is this better than qwen3.5 and I thus should switch over?

by rvz

1 subcomments

Open weight models once again marching on and slowly being a viable alternative to the larger ones.
We are at least 1 year and at most 2 years until they surpass closed models for everyday tasks that can be done locally to save spending on tokens.

by mybigbro

0 subcomment

Went through the official blog and the developers post, no mention of TurboQuant anywhere. Google's own research team tested it on Gemma models for KV-cache compression to 3 bits, so it's surprising it's not mentioned in this release. Anyone know if it's baked in already or if we'd need to apply it ourselves? Would love to run the 26B MoE locally as a daily driver.

by AnonyMD

0 subcomment

It's great that it can run in a local environment.

by virgildotcodes

2 subcomments

Downloaded through LM Studio on an M1 Max 32GB, 26B A4B Q4_K_M
First message:
https://i.postimg.cc/yNZzmGMM/Screenshot-2026-04-03-at-12-44...
Not sure if I'm doing something wrong?
This more or less reflects my experience with most local models over the last couple years (although admittedly most aren't anywhere near this bad). People keep saying they're useful and yet I can't get them to be consistently useful at all.

by gunalx

0 subcomment

We didnt get deepseek v4, but gemma 4. Cant complain.

by stefs

0 subcomment

i get a lot of tool call errors with gemma-4-26b-a4b, because the tokens don't seem to match up.

by DeepYogurt

1 subcomments

maybe a dumb question but what what does the "it" stand for in the 31B-it vs 31B?

by kvntrnz

0 subcomment

Let's gooo keen to try it out

by ahwg1iuwh

0 subcomment

اىلا

by ahwgiuwh

0 subcomment

اىلا

by bertili

2 subcomments

Qwen: Hold my beer
https://news.ycombinator.com/item?id=47615002

by EdoardoIaga

0 subcomment

great!

by daveguy

0 subcomment

Fyi, it took me a while to find the meaning of the "-it" in some models. That's how Google designates "instruction tuned". Come on Google. Definite your acronyms.

by matt765

0 subcomment

I'll wait for the next iteration

by einpoklum

0 subcomment

D: Di Gi Charat does not like this nyo! Gemma is supposed to help Dejiko-chan nyo!
G: They offered a very compelling benefits package gemma!

by heraldgeezer

2 subcomments

Gemma vs Gemini?
I am only a casual AI chatbot user, I use what gives me the most and best free limits and versions.

by vigneshj

0 subcomment

Great one to have

by ahwgiuwh

0 subcomment

rjdntrtd

by bibimsz

0 subcomment

is it good? what's it good for?

0 subcomment

by Agent01001

0 subcomment

looks cool

by techpulselab

0 subcomment

[dead]

by DanDeBugger

0 subcomment

[dead]

by devnotes77

0 subcomment

[dead]

by wei03288

0 subcomment

[dead]

by aplomb1026

0 subcomment

[dead]

by asim

0 subcomment

[dead]

by janalsncm

1 subcomments

I don’t think this should be dead @dang?

by davedyarrow

0 subcomment

[dead]

by mwizamwiinga

0 subcomment

curious how this scales with larger datasets. anyone tried it in production?

by kelsey98765431

0 subcomment

[dead]

by evanbabaallos

0 subcomment

[flagged]

by a7om_com

0 subcomment

[flagged]

0 subcomment