FRESH

Hacker News

Home

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

386 points by timhigins

by secretslol

12 subcomments

Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.

by rbbydotdev

4 subcomments

Looks like we are seeing small but mighty model breakthroughs, outpacing the pure capital firepower of SOTA providers. I love rooting for the little guy, but is it too soon to call it? To play devils advocate, could it just be the benchmarks are not efficient enough to capture success of real developer workflows?

by deftio

14 subcomments

There is some base level of intelligence any model needs to be useful, even in narrow tasks.
Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...
Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.

by gslepak

2 subcomments

Note that these are Python-only results, the model will not do as well with other languages.
I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.

by NotSuspicious

1 subcomments

The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).

by noperator

2 subcomments

Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.

by darkoob12

1 subcomments

I still cannot trust evaluations and benchmarks. How can you prove that the test datasets are truly unseen examples?
I think the only way to prove that these models are truly as good as they claim is to wait and see if they are getting adopted in practice.

by aero2146

4 subcomments

I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...

by mvitorino

0 subcomment

Really enjoying seeing these really capable SMLs. Note that on HF they state: "This model was not trained on tool-calling or agent-based programming data. We therefore do not recommend using it for tasks that involve function calling, API orchestration, or autonomous coding agents." - https://huggingface.co/WeiboAI/VibeThinker-3B So we can't just hook it up to a coding harness like pi.dev or something.

by achrono

4 subcomments

Beats Opus 4.5 on reasoning you say?
Prompt: If A goes to B who then goes to C, can A send something to C?
Response:
We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.
Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.
[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]

by troglodytetrain

0 subcomment

Sounds like something that could be pretty useful as a 'validation' subagent. Provide it the details/context related to a larger LLM's run or turn in a harness and have it act as a gatekeeper. At this size and speed it looks like it could be economical to have it run every turn or even every tool call and inform the main agent about the result and success/failure.

by sorenjan

2 subcomments

How would you best utilize a model like this for coding? I take it it's not meant for vibe coding a full app, and the reasoning probably makes it unsuitable for autocomplete. Would you use it to implement specific functions? I looked at one of the coding benchmarks used, Live Code Bench, and it seems to be problem descriptions with sample input and output, and then a solution with a single function or class.
Seems like a really good model to use in an IDE when you still want control over the code structure then.

by yousif_123123

2 subcomments

I really hope that in a couple of years I can have a laptop that runs a reasonably good coding agent locally, that I can run fast and do most of my programming with, without running my laptop hot. I could keep open code and use other models when needed, but really for most of my work, I'm already breaking it down so that I can review code changes eventually, and I just need something reasonably decent and fast and unlimited. I think its coming.

by SwellJoe

1 subcomments

It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).
https://swelljoe.com/post/will-it-mythos/

by androiddrew

4 subcomments

I have been thinking about how to use this. Since it doesn’t support tool calling I have been considering a dual model deployment, where a small tool calling llm drives the majority of the user experience, and vibe thinker is tapped for reasoning by the other llm.
So who has suggestions on small models with excellent tool calling capabilities?

by virajk_31

0 subcomment

SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.

by brainless

1 subcomments

I recently came across this model and I would love to try it with my coding agent soon.
I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.
I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).
https://github.com/brainless/nocodo

by nolist_policy

0 subcomment

Notable:

  VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model.

Qwen2.5 is ancient by LLM standards.

by iamgopal

0 subcomment

Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?

by delis-thumbs-7e

0 subcomment

I gave this a run on llama.cpp locally. My GPU is Ge1080, so I needed quantized version for even such a small model and…
This. Is. Amazing. I am flabbergasted.
I am not into the whole GenAI thing and I have very little need for anything agentic, but Python, C++ and Maths is exactly what I mostly used these for, so this might actually become my main work horse. This is so cool.
I even used it for stuff it is not built for, asking complex qustion on history (Battle of Tours 732) and literature (Joyce’s Portrait of Artist) and it was surprisingly good, even though it started to hallucinate names and details (such as claiming Joyce’s father was a priest). For 3B I expected it to mainly spout complete nonsense.

by nickalaso

0 subcomment

So I went ahead and quickly vibecoded a working harness with a barebones tool interface and some constraints on output (credit to noperator for the idea). github: https://github.com/NickalasLight/VibeHarness.git
Its meant for a Windows machine using ollama but I'm sure anyone who wants to mess with it can point claude code at it to convert it for your own operating system and requirements. After install you can ask it to do something with "vibe 'create me a poem about cheese in cheese.txt'" its workspace is by default the directory the cli was located in when you called it.

by tracerbulletx

0 subcomment

Man just need something like this with tool calling.

by andai

1 subcomments

I tried actually talking to it. It reminded me of GPT-2.

by jpcompartir

0 subcomment

The absolute worst name for a model I've seen

by cold_harbor

1 subcomments

GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale

by uberex

3 subcomments

What is the idiots guude to run this one local now?

by makethembroke

0 subcomment

I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normally
A alot randomness in it
Please don't hype

by unfirehose

0 subcomment

this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.

by diimdeep

0 subcomment

BF16 with no QAT quants == half backed bread

by scotty79

0 subcomment

If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.

0 subcomment

by anonyfox

0 subcomment

Wake me up when it does OCaml fine.

by 4gotunameagain

0 subcomment

What are the implications of local SOTA inference, given the insane datacenter "investing" ?
It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.
Will a viable local model crash the US economy ?
More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.

by viduus

0 subcomment

[flagged]

by diseasedyak

0 subcomment

[dead]

by sosojustdo

0 subcomment

[flagged]

by c121618

0 subcomment

[flagged]

by lisa_luoyf

0 subcomment

[flagged]

by riponcm

0 subcomment

[dead]

by jkwang

0 subcomment

[flagged]

by cheekygeeky

0 subcomment

[flagged]

by t_e_s_t

0 subcomment

[flagged]

by t_e_s_t

1 subcomments

[flagged]

by maxignol

0 subcomment

3B param on par with opus 4.5 sounds interesting. Will read the full article before making my mind

by zkmon

2 subcomments

Does python coding depend on political facts of the world?
It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?
There is a butterfly effect. Everything affects everything to some extent.

by kmchandy

0 subcomment

The paper makes a clear claim: "it provides an important and concrete proof: on well-constrained, verifiable reasoning tasks, first-tier performance is no longer the exclusive domain of ultra-large models" And that's exciting.