FRESH

Hacker News

Home

Are the costs of AI agents also rising exponentially? (2025)

300 points by louiereederson

by smusamashah

2 subcomments

Once a model is stable and good enough, for example Sonnet 4.6 or GPT 5.4 (or something else in future), it can be burned into hardware like Talaas chip reducing the cost many times and increasing the speed. At some point we can rely on old model while being productive with it.

by easygenes

1 subcomments

While I understand why they used the METR data, a cleaner look would be against the current cost-optimal frontier of open models (e.g. GLM-5.1 and MiniMax-M2.7). That paints a very different picture. Comparing just the frontier models at the time of the METR report invariably leads to looking at providers who are pushing the limits of cost at the time of the report.
GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.
So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.

by thelastgallon

4 subcomments

> On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet spot, but $13 per hour at the start of its final plateau. GPT-5 is about $13 per hour for tasks that take about 45 minutes, but $120 per hour for tasks that take 2 hours. And o3 actually costs $350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.

by zipy124

4 subcomments

The crazy part about this is if you compare it not to US wages but european, for instance in the UK where the median software hourly wage is somewhere around $35-40 an hour, then humans are already cheaper than the best models.

by JAG_Ecalona

3 subcomments

The sweet spot thing is the real insight here and nobody seems to be talking about it.
Frontier models get hyped for their maximum task horizon, but that's also where they're 10-30x more expensive per hour than their optimal range. You're paying a massive premium for the hardest tasks and still failing half the time.
Honestly the practical takeaway is pretty boring: just break your work into smaller chunks. Not because the models can't handle longer tasks, but because the economics at shorter task lengths are just way better. The labs are racing to push the horizon out; the smart move for anyone actually paying the bills is to stay near the sweet spot and orchestrate from there.

by dang

0 subcomment

Related ongoing thread:
Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)

by ting0

1 subcomments

No, but the AI labs would love to frame it this way so they can continue to nerf models and increase prices while they use the cheap, highly performant, highly powerful models internally to replace all of your businesses.

by greenmilk

7 subcomments

Are any inference providers currently making profit (on inference, I know google makes money)?

by quicklywilliam

2 subcomments

Interesting read. I don't know if I quite buy the evidence, but it's definitely enough to warrant further investigation. It also matches up with my personal experience, which is that tools like Claude Code are burning through more and more tokens as we push them to do bigger and bigger work. But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.
So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.

by hyperpape

1 subcomments

This is an interesting analysis, but "are the costs of AI agents also rising exponentially is?" is a very bad question that this doesn't answer.
What's rising exponentially is the price of the most ambitious thing cutting edge agents can do.
But to answer whether the cost of AI agents is rising in general, you would take a fixed set of problems, and for each of them, ask "once it's solvable, how does the price change?"
For that latter question, there isn't a lot of data in these charts because there aren't enough curves for models of the same family over time, but it does look like there are a number of points where newer models solve the same problems at lower prices. Look at GPT5 vs. the older GPT models--the curve for GPT5 is shifted left.

by agentifysh

2 subcomments

Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.

by matt3210

0 subcomment

I took a month break and my side project took 2x as much tokens

by lwhi

1 subcomments

I think an interesting counterpoint, is whether the value obtained is reducing.

by siliconc0w

0 subcomment

Working on a oss tool to help orgs identify where they can save on token costs: https://repogauge.org
Happy to run it on your repos for a free report: hi@repogauge.org

by EdvinPL

3 subcomments

AI feels more like a gamble. People like gambling. From casinos (win-loose), to lootboxes (uncertainty) or even extramarital sex (whose baby is it?).
This way - AI work is like a slot machine - will this work or not? Either way - casino gets paid and casino always wins.
Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.
Sadly - from by experiences hiring Devs - hiring people is also a gamble...

0 subcomment

by noosphr

0 subcomment

Yet again: Transformers are fundamentally quadratic.
If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.
Project costs from Claude Code bear this out in the real world.

by twaldin

0 subcomment

idk over my testing, glm-5 inside opencode beats all other agents head to head

by metalglot

0 subcomment

Not if you use local models, only if you're using cloud llm's

by keepamovin

1 subcomments

My expectation: demand going up, prices will rise, supply will saturate to the point of ubiquitous "utility" status, and prices will drop, probably a bell curve shape with sine-wave undulations along the way.

by twotwotwo

0 subcomment

You could model more of the process: the dev's work as well as the model's, and the cost of catching a bug later or deploying it live. Those tasks push me further towards smaller tasks in general. (And they make the Gas Town type stuff seem more baffling.)
- Smaller chunks make review much easier and more effective at finding bugs, as we've known since long before LLMs.
- Greater certainty provides a better development experience. I've heard people talk about how LLM development can be tiring. One way that happens, I think, is the win-or-lose drama of feeding in huge tasks with a substantial chance of failure. I think if you're succeeding 95% of the time instead of 70%, and the 5% are easier to deal with (smaller chunks to debug), it's a better experience.
- Everything is harder about real-world tasks because they aren't clean verifiable-reward benchmarks. Developers have context that models don't, so it's common that a problem traces to an detail not in the spec where the model guessed wrong. For real-world tasks "failures" are also sometimes "that UI is bad" or "that way of coding it is hard to maintain." And it's possible to have problems the dev simply doesn't notice. The benchmarks' fully computer-checkable outcomes are 'easy mode' compared to the real world.
- Fixing agents' mess becomes more work as task sizes increase. (Like the certainty thing, but about cost in hours than the experience.) Again, if the model has spat out 1000 lines and stumped itself debugging a failure, it'll take you some time to figure out: more time than debugging 250-line patch, and the larger patch is more likely to have bugs. And if an issue bug makes it out to peer review, you can add communication and context-switching cost (point out bug, fix, re-review) on top of that.
- Bugs that reach prod are really expensive. More of a problem when a prod bug can lose you customers vs., say, on most hobby things. Ord's post gestures at it: there are "cases where failure is much worse than not having tried at all." That magnifies how important it is the review be good, and how much of a problem bugs that sneak through are, which points towards doing smaller chunks.
How significant each factor is depends on details: how easy the task is to verify, how well-specified it is (and more generally how much it's in the models' wheelhouse, and how much in mine), how bad a bug would be (fun thing? internal tool? user facing? can lose data?).
I think the dynamics above apply across a range of model strengths, but that doesn't mean the changes from say Sonnet 3.7 to Opus 4.5 didn't mean anything; the machine getting better at getting the info it needs and checking itself still helps at shorter task lengths. Harness improvements can help, e.g. they could help keep models of the 'too much context, model got silly' zone (may be less severe than it once was, but I suspect will remain a thing), build better context, and clean up code as well as spitting results out.
Besides taking more of your time up front, involving yourself more also tends to drift towards you making more of the lower-level decisions about how the code will look, which I find double-edged. You have better broad context, and you know what you find maintainable. But the implementer, model or another person, is closer to the code, which helps it make some mid-to-low-level decisions well.
Plan modes and Spec-Kit type things can help with the balance of getting involved but letting the model do its thing. I've liked asking the LLM to ask a lot of questions and surface doubts. A colleague messed with Spec-Kit so it would pick one change on its fine-grained to-do list at a time, which is a neat hack I'd like to try sometime.

by stainablesteel

0 subcomment

it's not like cost and energy use aren't competitive factors in this game
the first model to outcompete its competitors while using less compute would be purchased more than anything else

by simianwords

0 subcomment

Why does the author suggest that costs are getting unsustainable when costs are actually decreasing exponentially over time?
It’s true that at a given point in time the cost to achieve a certain task follows exponential curve against time taken by a human. But.. so what?

by KaiShips

0 subcomment

[dead]

by pavelbuild

0 subcomment

[dead]

by agdexai

0 subcomment

[dead]

by jimmypk

0 subcomment

[dead]

by sergiopreira

0 subcomment

[dead]

0 subcomment

by bustah

0 subcomment

[dead]

by loklok5

0 subcomment

[dead]

by samoladji

0 subcomment

[dead]

by linzhangrun

0 subcomment

[dead]

by Zero_jester

0 subcomment

[dead]

by srslyTrying2hlp

0 subcomment

[dead]

by maxbeech

0 subcomment

[dead]

by aimadetools

0 subcomment

[dead]

by totalmarkdown

0 subcomment

[flagged]