FRESH

Hacker News

Home

Mercury: Ultra-fast language models based on diffusion

570 points by PaulHoule

by mike_hearn

28 subcomments

A good chance to bring up something I've been flagging to colleagues for a while now: with LLM agents we are very quickly going to become even more CPU bottlenecked on testing performance than today, and every team I know of today was bottlenecked on CI speed even before LLMs. There's no point having an agent that can write code 100x faster than a human if every change takes an hour to test.
Maybe I've just got unlucky in the past, but in most projects I worked on a lot of developer time was wasted on waiting for PRs to go green. Many runs end up bottlenecked on I/O or availability of workers, and so changes can sit in queues for hours, or they flake out and everything has to start again.
As they get better coding agents are going to be assigned simple tickets that they turn into green PRs, with the model reacting to test failures and fixing them as they go. This will make the CI bottleneck even worse.
It feels like there's a lot of low hanging fruit in most project's testing setups, but for some reason I've seen nearly no progress here for years. It feels like we kinda collectively got used to the idea that CI services are slow and expensive, then stopped trying to improve things. If anything CI got a lot slower over time as people tried to make builds fully hermetic (so no inter-run caching), and move them from on-prem dedicated hardware to expensive cloud VMs with slow IO, which haven't got much faster over time.
Mercury is crazy fast and in a few quick tests I did, created good and correct code. How will we make test execution keep up with it?

by true_blue

5 subcomments

I tried the playground and got a strange response. I asked for a regex pattern, and the model gave itself a little game-plan, then it wrote the pattern and started to write tests for it. But it never stopped writing tests. It continued to write tests of increasing size until I guess it reached a context limit and the answer was canceled. Also, for each test it wrote, it added a comment about if the test should pass or fail, but after about the 30th test, it started giving the wrong answer for those too, saying that a test should fail when actually it should pass if the pattern is correct. And after about the 120th test, the tests started to not even make sense anymore. They were just nonsense characters until the answer got cut off.
The pattern it made was also wrong, but I think the first issue is more interesting.

by mxs_

0 subcomment

In their tech report, they say this is based on:
> "Our methods extend [28] through careful modifications to the data and computation to scale up learning."
[28] is Lou et al. (2023), the "Score Entropy Discrete Diffusion" (SEDD) model (https://arxiv.org/abs/2310.16834).
I wrote the first (as far as I can tell) independent from-scratch reimplementation of SEDD:
https://github.com/mstarodub/dllm
My goal was making it as clean and readable as possible. I also implemented the more complex denoising strategy they described (but didn't implement).
It runs on a single GPU in a few hours on a toy dataset.

by fastball

2 subcomments

ICYMI, DeepMind also has a Gemini model that is diffusion-based[1]. I've tested it a bit and while (like with this model) the speed is indeed impressive, the quality of responses was much worse than other Gemini models in my testing.
[1] https://deepmind.google/models/gemini-diffusion/

by mtillman

1 subcomments

Ton of performance upside in most GPU adjacent code right now.
However, is this what arXiv is for? It seems more like marketing their links than research. Please correct me if I'm wrong/naive on this topic.

by chc4

3 subcomments

Using the free playground link, and it is in fact extremely fast. The "diffusion mode" toggle is also pretty neat as a visualization, although I'm not sure how accurate it is - it renders as line noise and then refines, while in reality presumably those are tokens from an imprecise vector in some state space that then become more precise until it's only a definite word, right?

by JimDabell

1 subcomments

Pricing:
US$0.000001 per output token ($1/M tokens)
US$0.00000025 per input token ($0.25/M tokens)
https://platform.inceptionlabs.ai/docs#models

by cavisne

1 subcomments

Are there any rules for what can be uploaded to arxiv?
This is a marketing page turned into a PDF, I guess who cares but could someone upload like a facebook marketplace listing screenshotted into a PDF?

by M4v3R

0 subcomment

I am personally very excited for this development. Recently I AI-coded a simple game for a game jam and half the time was spent waiting for the AI agent to finish its work so I can test it. If instead of waiting 1-2 minutes for every prompt to be executed and implemented I could wait 10 seconds instead that would be literally game changing. I could test 5-10 different versions of the same idea in the time it took me to test one with the current tech.
Of course this model is not as advanced yet for this to be feasible, but so was Claude 3.0 just over a year ago. This will only get better over time I’m sure. Exciting times ahead of us.

by gdiamos

1 subcomments

I think the LLM dev community is underestimating these models. E.g. there is no LLM inference framework that supports them today.
Yes the diffusion foundation models have higher cross entropy. But diffusion LLMs can also be post trained and aligned, which cuts the gap.
IMO, investing in post training and data is easier than forcing GPU vendors to invest in DRAM to handle large batch sizes and forcing users to figure out how to batch their requests by 100-1000x. It is also purely in the hands of LLM providers.

by amelius

2 subcomments

Damn, that is fast. But it is faster than I can read, so hopefully they can use that speed and turn it into better quality of the output. Because otherwise, I honestly don't see the advantage, in practical terms, over existing LLMs. It's like having a TV with a 200Hz refresh rate, where 100Hz is just fine.

by ceroxylon

1 subcomments

The output is very fast but many steps backwards in all of my personal benchmarks. Great tech but not usable in production when it is over 60% hallucinations.

by mseri

0 subcomment

Sounds all cool and interesting, however:
> By submitting User Submissions through the Services, you hereby do and shall grant Inception a worldwide, non-exclusive, perpetual, royalty-free, fully paid, sublicensable and transferable license to use, edit, modify, truncate, aggregate, reproduce, distribute, prepare derivative works of, display, perform, and otherwise fully exploit the User Submissions in connection with this site, the Services and our (and our successors’ and assigns’) businesses, including without limitation for promoting and redistributing part or all of this site or the Services (and derivative works thereof) in any media formats and through any media channels (including, without limitation, third party websites and feeds), and including after your termination of your account or the Services. For clarity, Inception may use User Submissions to train artificial intelligence models. (However, we will not train models using submissions from users accessing our Services via OpenRouter.)

by armcat

1 subcomments

I've been looking at the code on their chat playground, https://chat.inceptionlabs.ai/, and they have a helper function `const convertOpenAIMessages = (convo) => { ... }`, which also contains `models: ['gpt-3.5-turbo']`. I also see in API response: `"openai": true`. Is it actually using OpenAI, or is it actually calling its dLLM? Does anyone know?
Also: you can turn on "Diffusion Effect" in the top-right corner, but this just seems to be an "animation gimmick" right?

by EigenLord

2 subcomments

Diffusion is just the logically most optimally behavior for searching massively parallel spaces without informed priors. We need to think beyond language modeling however and start to view this in terms of drug discovery etc. A good diffusion model + the laws of chemistry could be god-tier. I think language modeling has the AI community's in its grips right now and they aren't seeing the applications of the same techniques to real world problems elsewhere.

by mynti

1 subcomments

is there a kind of nanogpt for diffusion language models? i would love to understand them better

by ianbicking

0 subcomment

For something a little different than a coding task, I tried using it in my game: https://www.playintra.win/ (in settings you can select Mercury, the game uses OpenRouter)
At first it seemed pretty competent and of course very fast, but it seemed to really fall apart as the context got longer. The context in this case is a sequence of events and locations, and it needs to understand how those events are ordered and therefore what the current situation and environment are (though there's also lots of hints in the prompts to keep it focused on the present moment). It's challenging, but lots of smaller models can pull it off.
But also a first release and a new architecture. Maybe it just needs more time to bake (GPT 3.5 couldn't do these things either). Though I also imagine it might just perform _differently_ from other LLMs, not really on the same spectrum of performance, and requiring different prompting.

by flockonus

0 subcomment

If anyone else is curious about the claim "Copilot Arena, where the model currently ranks second on quality"
This seems to be the link, mind blowing results if indeed is the case: https://lmarena.ai/leaderboard/copilot

by numpad0

0 subcomment

Is parameter count published? I'm by no means expert, but failure modes remind me of Chinese 1B class models.

by Alifatisk

4 subcomments

Love the ui in the playground, it reminds me of Qwen chat.
We have reached a point where the bottlenecks in genAI is not the knowledge or accuracy, it is the context window and speed.
Luckily, Google (and Meta?) has pushed the limits of the context window to about 1 million tokens which is incredible. But I feel like todays options are still stuck about ~128k token window per chat, and after that it starts to forget.
Another issue is the time time it takes for inference AND reasoning. dLLMs is an interesting approach at this. I know we have Groqs hardware aswell.
I do wonder, can this be combined with Groqs hardware? Would the response be instant then?
How many tokens can each chat handle in the playground? I couldn't find so much info about it.
Which model is it using for inference?
Also, is the training the same on dLLMs as on the standardised autoregressive LLMs? Or is the weights and models completely different?

by kazinator

0 subcomment

Google has Gemini Diffusion in the works. I joined the beta. Roughly speeking it "feels" a lot like 2.5 Flash in the style of its interaction and accuracy. But the walls of text appear almost instantaneously; you don't notice any scrolling.

by jonplackett

0 subcomment

Wow, this thing is really quite smart.
I was expecting really crappy performance but just chatting to it, giving it some puzzles, it feels very smart and gets a lot of things right that a lot of other models don't.

by skybrian

0 subcomment

Company blog post: https://www.inceptionlabs.ai/introducing-mercury-our-general...
News coverage from February: https://techcrunch.com/2025/02/26/inception-emerges-from-ste...

by TechDebtDevin

0 subcomment

Oddly fast, almost instantaneous.

by earthnail

0 subcomment

Tried it on some coding questions and it hallucinated a lot, but the appearance (i.e. if you’re not a domain expert) of the output is impressive.

by eden-u4

1 subcomments

No open model/weights?

by irthomasthomas

1 subcomments

I've used mercury quite a bit in my commit message generator. I noticed it would always produce the exact same response if you ran it multiple times, and increasing temperature didn't affect it. To get some variability I added a $(uuidgen) to the prompt. Then I could run it again for a new response if I didn't like the first.

by pmxi

1 subcomments

This is cool. I think faster models can unlock entirely new usage paradigms, like how faster search enables incremental search.

by convery

1 subcomments

It certainly is fast, but I'm curious if LLMs ever will figure out how bitshifts work..
e.g. from the playground: `static const uint64_t MERSENNE_PRIME = (1ULL << 127) - 1;` which it insists is the correct way to store a 128-bit integer in followup questions.

by ahmedhawas123

0 subcomment

Reinforcement learning really helped Transformer based LLMs evolve in terms of quality and reasoning which we saw as DeepSeek was launched. I am curious if what this is is equivalent to an early GPT 4o that has not yet reaped the benefits of add-on technologies that helped improve the quality?

by londons_explore

0 subcomment

It's a little sad that the "diffusion effect" checkbox is just that - an effect.
It would be neat to show to the user all the real intermediate steps.

by thelastbender12

1 subcomments

The speed here is super impressive! I am curious - are there any qualitative ways in which modeling text using diffusion differs from that using autoregressive models? The kind of problems it works better on, creativity, and similar.

by seydor

0 subcomment

I wonder if diffusion llms solve the hallucination problem more effectively. In the same way that image models learned to create less absurd images, dllms can perhaps learn to create sensical responses more predictably

by luckystarr

0 subcomment

I'm kind of impressed by the speed of it. I told it to write a MQTT topic pattern matcher based on a Trie and it spat out something reasonable on first try. It hat a few compilation issues though, but fair enough.

by nashashmi

0 subcomment

I guess this makes specific language patterns cheaper and more artistic language patterns more expensive. This could be a good way to limit pirated and masqueraded materials submitted by students.

by mmaunder

1 subcomments

Code output is verifiable in multiple ways. Combine that with this kind of speed (and far faster in future) and you can brute force your way to a killer app in a few minutes.

by KaranSohi

0 subcomment

We have used their LLM in our company and it's great! From Accuracy to speed of response generation, this model seems very promising!

by empiko

1 subcomments

I strongly believe that this will be a really important technique in the near future. The cost saving this might create is mouth watering.

by baalimago

1 subcomments

I, for one, am willing to trade accuracy for speed. I'd rather have 10 iterations of poor replies which forces me to ask the right question than 1 reply which takes 10 times as long and _maybe_ is good, since it tries to reason about my poor question.

by storus

1 subcomments

Can Mercury use tools? I haven't seen it described anywhere. How about streaming with tools?

by beef_rendang

0 subcomment

Use case for diffusion based text generation models (as opposed to image models)?

by awaymazdacx5

0 subcomment

Having token embeddings with diffusion models, for 16x16 transformer encoding. Image is tokenized before transformers compile it. If decomposed virtualization modulates according to a diffusion model.

by loaderchips

0 subcomment

i wonder how fast this would be when run on something like groq

by 202282020008

0 subcomment

hello

by mmaunder

0 subcomment

Holy shit that is fast. Try the playground. You need to get that visceral experience to truly appreciate what the future looks like.

by careful_ai

0 subcomment

[dead]

by rudderdev

0 subcomment

[dead]

by b0a04gl

0 subcomment

[dead]

by usernamp

0 subcomment

[flagged]

by usernamp

0 subcomment

[flagged]

by usernamp

0 subcomment

[flagged]