FRESH

Hacker News

Home

DiffusionGemma: 4x Faster Text Generation

301 points by meetpateltech

by vineyardmike

9 subcomments

Recently I had switched to OpenCode to try out many of the Non-US-Frontier-Labs models. My unexpected favorite model to use was Mercury (a diffusion model). Not because it was “smart” but because it was stupid fast. It was more of a pair-programming experience instead of the SOTA agentic experience of prompting and waiting. Honestly, it was also way more fun and brought back some of the pre-AI coding experience while still getting some benefits of AI. It felt less of a slot machine where you prompt, wait, and hope it went in the right direction. It made me even use the tiny models like Gemini Flash Lite and GPT Mini/Nano more too.
Anyways, so excited for an open-weight model and I hope it performs well. I’ll be testing this ASAP.

by samuelknight

2 subcomments

Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.
An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.

by SwellJoe

3 subcomments

Google keeps flexin'. It's surprising that Gemini isn't more competitive against Claude or OpenAI models for code and agentic use, because it's clear Google still has some of the best AI people in the business. But, I guess Google is focused on stuff that runs on phones and near-realtime use cases, rather than the big thinky LLMs.
All these efficiency improvements seem likely to be really important to the future of AI, though, as the money starts flowing the other direction. The days of subsidized tokens to try to lock people into specific ecosystems are coming to an end, and we're going to have to start paying what it actually costs.
The companies that figure out how to make it cost-effective to run really smart models are the ones that will win. DeepSeek costs an order of magnitude less than GPT 5.5 or Opus 4.8. It's worse than either, but not catastrophically worse. I'll happily pay ten times as much for the best coding model, because it saves enough human time to justify it, but not a hundred times as much, which is where things seem to be heading (GPT 5.5 Pro cost over 200 times as much as DeepSeek in some benchmarks I recently did, and ~30 times as much as Opus 4.8).

by simonw

3 subcomments

NVIDIA are hosting a free endpoint for this one, details at https://build.nvidia.com/google/diffusiongemma-26b-a4b-it - you have to create an account and (I think) verify a phone number too.
(I got it to draw a pelican: https://tools.simonwillison.net/markdown-svg-renderer#url=ht... )

by beklein

0 subcomment

A good visual explanation of how text diffusion models like DiffusionGemma work: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...

by bachmeier

1 subcomments

> DiffusionGemma reverses this inefficiency. Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential. It upgrades your model inference from a single, sequential typewriter to a massive printing press that stamps the entire block of text simultaneously.
> Operating as a 26B total Mixture of Experts (MoE) model that activates only 3.8B parameters during inference, DiffusionGemma fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
Okay, so Gemma 4 26B is a MoE model that's really fast on my 24 GB GPU using ollama. This sounds like speculative decoding but I don't think that works with MoE models? It's hard to keep up with all this when it's not your job to keep up with it.

by minimaxir

3 subcomments

A few days ago I was just thinking that Google never talked about their diffusion text generation model after demoing it at I/O a year ago. The rumor is that it was too expensive to run, but with the provided chart using the same 1x H100 hardware and comparing DiffusionGemma to regular Gemma, that shouldn't be the case. I'm curious what the downside for this speed is here aside from being slightly weaker than Gemma.

by SkitterKherpi

1 subcomments

It is cool but local models while okay already feel noticeably worse than even the cheapest APIs so I can't see myself sacrificing even a little bit of their quality for speed. I'm sure it's worth it for some usecases, curious to hear specific ones that people are already planning to deploy to production.

by schmorptron

1 subcomments

What would a diffusing reasoning model look like? have a pre-defined length [thinking] block that gets diffused over a long time, and then the final output block uses what is in that thinking block as part of its input? And how do diffusion models decide the output length in the first place, is it a pre-set parameter? or does it diffuse an [end] token into the middle somewhere?

by petercooper

0 subcomment

I'm not getting anywhere near the speeds advertised on my 3090 Ti, alas, but it's fun watching it "fill out" its answers. I did Simon's "SVG pelican on a bicycle" test on it and the result was quite minimalistic but fit the brief: https://gist.github.com/peterc/7672e74ec1437945e5fca5ce2c1c9... -- this was on the Q4 quant running on patched llama.cpp. I will be interested to see if Simon's looks much different.

by xnx

2 subcomments

Is the diffusion approach any use in Multi-Token Prediction (MTP) drafters? https://blog.google/innovation-and-ai/technology/developers-...

by incognito124

0 subcomment

I just *love* the commit message on Github: "Make TPUs go brr"

by roosgit

1 subcomments

Can LoRAs be used to increase the quality of these diffusion models? Nvidia mentions something about this https://huggingface.co/nvidia/Nemotron-Labs-Diffusion-8B#inf...

by LarsDu88

2 subcomments

Does anyone know of the current intrinsic limitations with Diffusion text models compared to autoregressive?
I ran this question by ChatGPT and Claude and they came up with limitations in GRPO RLVR, but I'm not sure..

by anotherpaul

0 subcomment

Maybe someone can explain: in image generation some models are already using rectified flow. Which was hailed as the next big thing. Are we going to see discrete rectified flow models next or is that unlikely?

by zamalek

0 subcomment

Is anyone doing text diffusion in latent space instead of tokens?

by kkukshtel

2 subcomments

I think this is the future. The sort of left-field rumble that turns into a quake in 5 years.

by najarvg

2 subcomments

Do diffusion models support tool calls? If so is the tool call support on par with autoregressive models or worse? (edited spelling)

by bandrami

0 subcomment

I always thought that fundamentally diffusors were the cooler idea of the two

by chc4

0 subcomment

it just me that thinks its kinda weird that they conflate speed in tokens/second and latency, when i think of latency as time to first token? like it generates an entire paragraph of tokens faster but wouldnt it still be slower if your reply is only 1 word because it has to do the entire 256 tokens as a chunk

by diimdeep

0 subcomment

I wish labs would do QAT and release these quants, at this point looking at releases of bf16 without QAT feels like looking at half backed bread, we can quantized it but it is not the same as QAT. Or I am missing something here ?

by nullc

0 subcomment

Has anyone evaluated any diffusion LLMs for error spotting?
E.g. run your normal autoregressive LLMs (with MTP whatever, as you like), then run a single diffusion pass over the result, and observe any tokens that diffusion thinks are unlikely.
Then prompt the autoregressive llm with some structured reasoning "<think>Is <diffusion unlikely part> an error? .."
Because the diffusion model is so structurally different perhaps it makes different errors such that this would provide gains even vs running distinct autoregressive LLMs which often make the same errors.
The same argument could apply for RWKV but it would be relatively expensive to apply it as a second pass on a big block of output, while it seems like a diffusion model would be cheaper.

by jauntywundrkind

0 subcomment

I'm curious how diffusion models do at tool calling, curious what wins there are there.
The video demo of the svg sword is an interesting example of what is so interesting about diffusion models: it's not just putting one token after another to make edits to a file. It's skipping around, it's re-editing previous lines. I feel like forcing it to write too calls is maybe not its best nature.
I feel like perhaps instead of a monolithic edit file tool call, perhaps the diffusion model would be better suited to posting a change stream, a series of edit ops, across multiple files.

by jlintc

0 subcomment

[flagged]

by hmate9

0 subcomment

I can’t help but feel like there’s something here that will matter for future LLMs.
The bidirectionality could be a big deal: being able to refine a sentence with both left and right context feels closer to how editing/thinking actually works than committing to each token forever.
Maybe the current models aren’t good enough yet, but the direction feels important.

by 2001zhaozhao

0 subcomment

[dead]

by rvz

1 subcomments

We need more local open weight models that are performant and just as good (or good enough) as the best frontier ones.
Then you will be able to achieve Jevons Paradox and enjoy the same “productivity gains” without paying for these extortionate token prices by closed model providers or have it as cheap as possible.
And especially, no silent nerfing of the model.