I think it would easily have taken me 4+ hours to do that. It ran in 15 minutes while I played Kirby Air Riders and worked on the first try.
Afterward, I sort of had to reflect on the fact that I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature. It kept me learning the thing I cared about rather than doing a side quest.
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.
Anyhow, I noticed more of a difference trying Opus 4.5 compared to Sonnet 4.5 than I'd noticed from, for example, the last couple Sonnet bumps. Objectively, at 1.66x Sonnet's price instead of the old 5x, it's much more often practical to consider reaching for than past Opus models. Anthropic's basic monthly thing also covers a fair amount of futzing with it in CC.
At the other extreme, another surprise of this family is that Haiku 4.5 with reasoning on is usable: better than Sonnet with thinking off according to some bencharks, and in any case subjectively decent for point edits, single-page thingies, and small tools.
The contrary is easily verifiable by everyone individually. It's nowhere near 100%, or even 50% for few minutes tasks even with the best models in real world situations.
What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.
This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.
The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?
Solutions are quite easy to verify with differential testing and produce a number for direct comparison.
Less code is usually better and you generally can't "cheat" by adding more cruft so it nullifies the additive bias. Good optimization requires significant understanding of the underlying structures. Everything has performance tradeoffs so it requires systemic thinking and not just stringing independent pieces together.
So far I've found that Gemini Pro 3 was the best at reasoning about tricky SIMD code but the results with most models were pretty underwhelming.
What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.
This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.
The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?
I don't think I have 50% success rate at month long tasks.
Anything that exceeds one day is pretty hard.
The Conclusions section is not for making a sales pitch for your article. It is for summarizing any new knowledge the article brings out.
Extrapolating any exponential growth is always dangerous, but over say 3 years at this pace, we'd go from 2 hours to 70,or about 8 days' work.
Quite scary. But what does cost do over the same timeline? Does it increase with computational complexity? Is it worse - because, IIRC, transformers computational cost is quadratic in context length. Is it better - some kind of economies of scale?
I glanced thought the article but couldn't find any info on this.
It matches the my personal feeling when using progressively better models over time.
An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!
This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.
How about something like total output token count as the "long term horizon" metric instead?
And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!
If you fail to break up the task into agent sized chunks, you're the problem.
If true, how much of this is a result of:
1. Genuine technical advancement
or:
2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?
In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?