- “Strong closed-weight coding agents like Devstral Small 2 are an important point of comparison.”
Devstral Small 2 is an open-weights model: https://huggingface.co/mistralai/Devstral-Small-2-24B-Instru...
- AFAIK gpt-oss-20b on high reasoning has SWE score of just over 60. It is smaller than all comparable models. Maybe I am missing something, but it is still state of the art all the way up to 50B parameters vs all models released after.
At least https://huggingface.co/facebook/cwm team had balls comparing to it directly (sort of, see TTS).
What does this model do that gpt-oss-20b does not? AFAIU the base model it was finetuned from is not reproducible, and if I flip a single bit in gpt-oss-20b and tell you how (instruction under MIT) that would satisfy "fully open finetuning" they claim as advantage. But that "open" fine-tuned gpt-oss-20b is probably going to beat their model.
Am I missing something?
by nickandbro
0 subcomment
- Great work! Really respect AI2. they open source everything. The model, the weights, the training pipeline, inference stack, and corpus
- Claims in the article are incorrect. They conveniently ignore Meta CWM models, which are open-sourced [1] and open-weight [2] and are at 65% SWE-bench verified (with TTS) and 54% pass@1 and the same size (32B dense). So claims like "surpassing prior open-source state-of-the-art coding models of comparable sizes and context lengths" and conveniently leaving out the previous OSS SOTA out of your eval tables are ... sketch.
[1]https://github.com/facebookresearch/cwm
[2]https://huggingface.co/facebook/cwm
by d4rkp4ttern
3 subcomments
- An interesting shift I’ve seen over the past few weeks, is we’re starting to refer to bare LLMs themselves as “agents”.
Used to be that agent = LLM + scaffold/harness/loop/whatever.
by hogehoge51
1 subcomments
- Whats the practical benefit of fine tune training on a local repo, vs putting the summary of local infomation in the context? i.e every team has their own style and preference for coding patterns that could be generalized - but i imagine a large scale model has seen fhem all so they could be described in the context, or are there specific domain level patterns that can be generalized that would never be seen outside an org so are difficult for a model to infer without fresh tunning?
by jauntywundrkind
1 subcomments
- Awesome stuff. Output speed looks crazy fast too.
I wonder if this indeed will start prompting more language specific work.
Afaik training still requires not just looking at sample code but also being able to write loss functions being able to have problems the AI can work at. That seems hard.
One random thought, are there training styles of just deleting some code from "good" projects then making the AI make it work again?
by ripped_britches
0 subcomment
- One claim in article is definitely very wrong or at least needs to be narrowed. Claude is the only closed agent harness and there are about two dozen open ones. Many models may be closed, but when people say agent they are generally referring to the harness, not the underlying model.
by Imustaskforhelp
0 subcomment
- Hey this looks great? Is it available on Openrouter.
I wish if AI2 could release a more denser model on Openrouter for free than the 8B model as I was using Devstral model for agentic purposes.
If we can get an agentic good 32B like model on openrouter for ~free, then I feel like it will be very interesting to see how things would go imo.
Good luck with AI2! The premise of truly open source models is really interesting and I feel like it could help bring more innovation in the space imo!
- Note that this is also a super interesting technique for specialising consumer facing apps like Lovable that need to generate code that matches your API very well.
It's also a great approach for building custom languages.
by mirekrusin
0 subcomment
- For low cost tuning wouldn't something like LoRa via ie. unsloth on ie. GLM-4.7-Flash be the way to go?
- it's great to see this kind of progress in reproducible weights, but color me confused. this claims to be better and smaller than Devstral-Small-2-24B, while clocking in at 32B (larger) and scoring more poorly?
- So this "open" system still requires you to use Claude to actually use it?
by asyncadventure
1 subcomments
- [dead]
- [flagged]