Yeah, it's a well-known problem. Every AI company is working on ways to deal with it, one way or another, with clever data center design, and/or clever hardware and software engineering, and/or with clever algorithmic improvements, and/or with clever "agentic recursive LLM" workflows. Anything that actually works is treated like a priceless trade secret. Nothing that can put competitors at a disadvantage will get published any time soon.
There are academics who have been working on it too, most notably Tri Dao and Albert Gu, the key people behind FlashAttention and SSMs like Mamba. There are also lots of ideas out there for compressing the KV cache. No idea if any of them work. I also saw this recently on HN: https://news.ycombinator.com/item?id=46886265 . No idea if it works but the authors are credible. Agentic recursive LLMs look most promising to me right now. See https://arxiv.org/abs/2512.24601 for an intro to them.
disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.
Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.
You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.
I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).
LLMs will have to eventually cross this hurdle before they become our replacements
Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.
To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.
This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.
So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.