Cursor's benchmark finds that Cursor's model (Composer 2.5) is basically as good as Opus 4.8 max and GPT-5.5 xhigh, but at a fraction of the price.
Artificial Analysis' testing shows Composer 2.5 to be pretty far behind: https://artificialanalysis.ai/agents/coding-agents. You look at the DeepSWE benchmark (which is probably the hardest to game at this point) and GPT-5.5 xhigh gets a 64, Opus 4.8 max gets 56, and Cursor 2.5 gets 16.
I don't doubt that Cursor works well for some people. It's beating DeepSeek v4 Pro in the DeepSWE benchmark and that's a very capable model. But I'm skeptical of the claims that it's a competitor for Opus 4.8 and GPT-5.5. It just seems convenient that their model does so well on their own benchmark while third party benchmarks have it far behind. Maybe it's a really great benchmark and a better measure than third party ones - I'd love for a cheap model to do as well as the expensive ones.
Putting that aside, I spend all day every day implementing very, very hard things right on the edge of what agents are (barely, sometimes) capable of, and I have had to keep Opus on max for things that need 'real validation' for a while now. And that has felt like 'the only way' to get Opus to perform even close to 5.5 xhigh. I'm only using Opus at all because GPT-5.5 in the subscriptions only has a small (400k, but 258k effective) context window.
The difference is that 5.5 xhigh is extremely fast in most practical cases, both efficiently implementing _overall_, and responding very quickly with great adaptive thinking if you ask it something that it doesn't have to think about. Opus 4.8 Max will needlessly chew on everything and can take hours to implement even simple things, so I can mostly only use it for planning/review.
Fable is much much better at adaptive thinking / responding quickly (although probably still worse than 5.5 xhigh), and... I think folks have said enough elsewhere about its strengths and weaknesses. Sadly still not a reliable implementor for my hard tasks though (that's still GPT's domain) – it tends to leave big, dangerous holes hiding inside implementations unless babied.
Although, the benchies are always tricksy ... On DeepSWE, GPT-5.5 beats Opus-4.8, by a fair margin, but on FrontierCode, the situation is the other way around.
The only benchmark you can trust is your actual workload!
Fable is using less tokens to achive that same tasks compared to sonet and opus. If so that is a good thing. It feels like we for a while there was spitting out tokens to get a better result. If the model themselves are getting better without generating more tokens that feels like a real win.
Q1: Why is number of steps relevant in this graph? What does it tell us?
Q2: and why have they flipped the horizontal graph so that 0 is to the right and not at origo? Is that some kind of new smart thing? can't say i have seen it before
The other models however are reasonably where I’d expect them to be from experience piloting all of them. Fable is outclassing everything at most things at 10x the cost, but sometimes it isn’t a choice between cheap and expensive, but expensive and possible; I’ll need to learn where that boundary is just as it was the case with other models.
And given that you can only use Composer with a Cursor monthly subscription, cost comparisons are pointless since an equivalently priced OpenAI subscription gets you just as much usage of the better model.
Can we get a count of people that have had Claude read irrelevant documents or perform unnecessary web searches even when told not to from the beginning?
I'm starting to wonder if this increased token usage is inadvertently bleeding into how Anthropic actually trains their model, especially leading up to IPO. As older models are deprecated and users are forced onto newer models, if the default is less efficient and more token expensive that directly results in higher "profit" for Anthropic in terms of the consumption their users have to tolerate - lest they jump to a competitor.