FRESH

Hacker News

Home

Measuring AI Ability to Complete Long Tasks

245 points by spicypete

by subdavis

8 subcomments

I recently asked Opus to just “Add vector search” to my current hobby project, a topic I know very little about. It set up manticore, pulled an embedding model, wrote a migration tool for my old keyword indices, and built the front end. I’m not exaggerating much either: the prompt was the length of a tweet.
I think it would easily have taken me 4+ hours to do that. It ran in 15 minutes while I played Kirby Air Riders and worked on the first try.
Afterward, I sort of had to reflect on the fact that I learned essentially nothing about building vector search. I wanted the feature more than I wanted to know how to build the feature. It kept me learning the thing I cared about rather than doing a side quest.

by simonw

9 subcomments

I didn't really understand the "long task" thing until I actually experienced it. The problem is finding a task you can set an agent that justifies working for that long. I finally hit one when I tried porting that Python HTML5 parser to JavaScript by pointing Codex CLI at the 9,200 html5lib-tests test suite: https://simonwillison.net/2025/Dec/15/porting-justhtml/
It's pretty amazing to watch tools-in-a-loop crunch away for >4 hours to solve a generally difficult problem through sheer brute-force.

by twotwotwo

0 subcomment

I'm conflicted about opining on models: no individual has actually done a large sample of real-world tasks with a lot of models to be able to speak with authority, but I kinda think we should each share our dubiously-informed opinions anyway because benchmarks aren't necessarily representative of real-world use and many can clearly be gamed.
Anyhow, I noticed more of a difference trying Opus 4.5 compared to Sonnet 4.5 than I'd noticed from, for example, the last couple Sonnet bumps. Objectively, at 1.66x Sonnet's price instead of the old 5x, it's much more often practical to consider reaching for than past Opus models. Anthropic's basic monthly thing also covers a fair amount of futzing with it in CC.
At the other extreme, another surprise of this family is that Haiku 4.5 with reasoning on is usable: better than Sonnet with thinking off according to some bencharks, and in any case subjectively decent for point edits, single-page thingies, and small tools.

by bicepjai

0 subcomment

IMHO, in the software field, learning can be simpler to 2 phases. The first one is exploration, where we read blogs, docs, and books; listen to lectures and talks. Then comes the second phase of exploitation, where we actually use the thing we learned. You can think of all those “learning from scratch” videos as someone who is doing the phase 2. I love the phase one and most of the time don’t have time and energy to sit down and go through the phase 2. Nowadays, I feel like the 2 phases are combined, thanks to LLMs. For instance, I wanted to do some animation for visualizations. This week, I learned AnimeJS by watching CCAgent create the animation I wanted, which was interspersed with questions that were answered with diagrams and text, which accomplishes the phase 1. I do not like letting them run the show. Then comes phase 2, where I organize the code, abstract things, rewrite code, still use their help for long rewrites, but totally my ideas and mine only. This saves time tremendously.

by pugio

1 subcomments

Opus looks like a big jump from the previous leader (GPT 5.1), but when you switch from "50%" to "80%", GPT 5.1 still leads by a good margin. I'm not sure if you can take much from this - perhaps "5.1 is more reliable at slightly shorter stuff, choose Opus if you're trying to push the frontier in task length".

by atleastoptimal

0 subcomment

They should do a 95% and 99% version of the graphs, otherwise it's hard to ascertain whether the failure cases will remain in the elusive "stuff humans can do easily but LLM's trip up despite scaling"

by iLoveOncall

1 subcomments

> current models have almost 100% success rate on tasks taking humans less than 4 minutes
The contrary is easily verifiable by everyone individually. It's nowhere near 100%, or even 50% for few minutes tasks even with the best models in real world situations.

by yoan9224

2 subcomments

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.
What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.
This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.
The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?

by 0x000xca0xfe

0 subcomment

After spending many hours optimizing some routines I now think performance optimization is a great benchmark for identifiying how generally smart an AI is at helping with some specific piece of code.
Solutions are quite easy to verify with differential testing and produce a number for direct comparison.
Less code is usually better and you generally can't "cheat" by adding more cruft so it nullifies the additive bias. Good optimization requires significant understanding of the underlying structures. Everything has performance tradeoffs so it requires systemic thinking and not just stringing independent pieces together.
So far I've found that Gemini Pro 3 was the best at reasoning about tricky SIMD code but the results with most models were pretty underwhelming.

by karimQuant

1 subcomments

The big issue is the 50%, if you switch to 80% it's much less. Now if you are in the wrong side of 50% given the task was 4hours. How much additional time to 4hours you need. repeat trying to get the task done 50%*50%->25% , 50%^4 -> 6.25%. the cost of bad luck is very high.

by yoan9224

1 subcomments

The key insight from this benchmark is using "human-equivalent hours" rather than actual AI execution time. It's measuring capability complexity, not speed.
What's interesting is the 50% vs 80% reliability gap. At 50% success rate on a 4-hour task, you're essentially gambling. If it fails, you've potentially wasted the 4 hours plus the time debugging why it failed.
This is why I think the current "agent" paradigm needs human checkpoints at regular intervals. Let the AI work for 30 minutes, then review progress. Repeat. This way you catch drift early before it compounds.
The other thing missing from these benchmarks: recovery ability. When the AI gets stuck on hour 3 of a 4-hour task, can it recognize the problem and backtrack? Or does it confidently continue down the wrong path?

by scotty79

0 subcomment

> As shown above, when we fit a similar trend to just the 2024 and 2025 data, this shortens the estimate of when AI can complete month-long tasks with 50% reliability by about 2.5 years.
I don't think I have 50% success rate at month long tasks.
Anything that exceeds one day is pretty hard.

by zkmon

0 subcomment

> We believe this work has important implications ... > First, our work demonstrates an approach ...
The Conclusions section is not for making a sales pitch for your article. It is for summarizing any new knowledge the article brings out.

by rich_sasha

0 subcomment

How does "cost" per frontier task change with time?
Extrapolating any exponential growth is always dangerous, but over say 3 years at this pace, we'd go from 2 hours to 70,or about 8 days' work.
Quite scary. But what does cost do over the same timeline? Does it increase with computational complexity? Is it worse - because, IIRC, transformers computational cost is quadratic in context length. Is it better - some kind of economies of scale?
I glanced thought the article but couldn't find any info on this.

by yismail

2 subcomments

Would be interesting to see Gemini 3.0 Pro benchmarked as well.

by big-chungus4

1 subcomments

"Train adversarially robust image model" is not a long task imo

by sshh12

0 subcomment

For folks interested in some of the nuances of this benchmark, I just posted this deep dive:
https://blog.sshh.io/p/understanding-ai-benchmarks

by grim_io

0 subcomment

This seems like a good way to measure LLM improvement.
It matches the my personal feeling when using progressively better models over time.

by NiloCK

3 subcomments

I appreciate horizon expansion as a fundamental metric, but duration seems like too crude a measure. We used to like it when computers were fast.
An infinitely unscrupulous model provider could double this five hour result by cutting your output tokens/second in half!
This isn't only a question of gaming the metric: the very strong current small-fast models (4.5 Haiku, Gemini 3 Flash) have no hope of being measured fairly against this - they will succeed or fail much faster just because they are much faster.
How about something like total output token count as the "long term horizon" metric instead?

by Aperocky

1 subcomments

I think the problem here is LLM eventually pollute its context window with so much of the current task that the larger picture or architectural sanity is forgotten in favor of the current task at hand.
And rarely is a software one and done, with a few round like this, the software architecture would have become schizophrenic. Combating this tendency usually require a lot of the work of these "long task" to be thrown away and more closely limiting what the AI is trying to do as they happen. The success of one "long task" is not necessarily a good thing!

by mkoubaa

0 subcomment

Ask not what the agent can do you for you, ask what you can do for the agent.
If you fail to break up the task into agent sized chunks, you're the problem.

by alexgotoi

0 subcomment

[dead]

by bentobean

3 subcomments

> We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.
If true, how much of this is a result of:
1. Genuine technical advancement
or:
2. Shoveling trillions of dollars into compute resources in order to service incoming LLM requests in a way that is completely unrealistic over the long term?
In other words… are we talking about genuine, sustainable innovation that we get to take with us moving forward and benefit from? Or are we talking about an “improvement” that is more akin to a mirage that will eventually disappear when the Ponzi scheme eventually collapses?

by nrhrjrjrjtntbt

1 subcomments

Why measure in minutes and not tokens? Seems you could cheat by slowing the ai down.

by Davidzheng

1 subcomments

Big error bars and METR people are saying the longer end of the benchmark are less accurate right now. I think they mean this is a lower bound!

by Dwedit

3 subcomments

Opus is already the name of an audio codec.