FRESH

Hacker News

Effective harnesses for long-running agents

112 points by diwank

by roughly

4 subcomments

One of the things that makes it very difficult to have reasonable conversations about what you can do with LLMs is the effort-to-outcome curve is basically exponential - with almost no effort, you can get 70% of the way there. This looks amazing, and so people (mostly executives) look at this and think, “this changes everything!”
The problem is the remaining 30% - the next 10-20% starts to require things like multi-agent judge setups, external memory, context management, and that gets you to something that’s probably working but you sure shouldn’t ship to production. As to the last 10% - I’ve seen agentic workflows with hundreds of different agents, multiple models, and fantastically complex evaluation frameworks to try to reduce the error rates past the ~10% mark. By a certain point, the amount of infrastructure and LLM calls are running into several hundred dollars per run, and you’re still not getting guaranteed reliable output.
If you know what you’re doing and you know where to fit the LLMs (they’re genuinely the best system we’ve ever devised for interpreting and categorizing unstructured human input), they can be immensely useful, but they sing a siren song of simplicity that will lure you to your doom if you believe it.

by _boffin_

4 subcomments

0 subcomment

by rancar2

0 subcomment

Having done this for Gemini CLI to get it to behave well several months ago to have a non-coding LLM CLI without costs, I can attest that these tips work well across CLIs.

by daxfohl

0 subcomment

IME a dedicated testing / QA agent sounds nice but doesn't work, for same reasons as AI / human interaction. The more you try to diverge from the original dev agent's approach, the less and less chance there is that the dev agent will get to where you want it to be. Far more frequently it'll get stuck in a loop between two options that are both not what you want.
So adding a QA agent, while it sounds logical, just ends up being even more of this. Rather than converging on a solution, they just get all out of whack. Until that is solved, far better just to have your dev agent be smart about doing its own QA.
The only way I could see the QA agent idea working now is if it had the power to roll back the entire change, reset the dev agent, update the task with some hints of things not to overlook, and trigger the dev process from scratch. But that seems pretty inefficient, and IDK if it would work any better.

by awayto

0 subcomment

> Run pwd to see the directory you’re working in. You’ll only be able to edit files in this directory.
If you're using the agent to produce any kind of code that has access to manipulate the filesystem, may as well have it understand its own abilities as having the entirety of CRUD, not just updates. I could easily see the agent talking itself into working around "only be able to edit" with its other knowledge that it can just write a script to do whatever it wants. This also reinforces to devs that they basically shouldn't trust the agent when it comes to the filesystem.
As for pwd for existing projects, I start each session running tree local to the part of the project filesystem I want to have worked on.

by ford

0 subcomment

I think we take git for granted as software engineers Software engineering has decades of experience with proposing changes, merging them, staging them, deploying them, and rolling them back, and collaborating with other code-writers (engineers and agents).
I'm very interested in what this will look like for outputs from other job functions. And if we'll end up with a similar framework that makes non-deterministic, often-wrong LLMs easier to work with.

by CurleighBraces

1 subcomments

I wonder how good these agents would be using something like cucumber and behaviour driven development tools?

by johndhi

2 subcomments

Why do we need long running agents? Most of my experienced value with LLMs has been like 1 to 10 turn chats. Should they just ban longer chats to solve these issues?

by dangoodmanUT

0 subcomment

> … the model is less likely to inappropriately change or overwrite JSON files compared to Markdown files.
Very interesting.

by slurrpurr

0 subcomment