If I start out with a "spec" that tells AI what I want, it can create working software for me. Seems great. But let's say some weeks, or months or even years later I realize I need to change my spec a bit. I would like to give the new spec to the AI and have it produce an improved version of "my" software. But there seems to be no way to then evaluate how (much, where, how) the solution has changed/improved because of the changed/improved spec. Becauze AI's outputs are undeterministic, the new solution might be totally different from the previous one. So AI would not seem to support "iterative development" in this sense does it?
My question then really is, why can't there be an LLM that would always give the exact same output for the exact same input? I could then still explore multiple answers by changing my input incrementally. It just seems to me that a small change in inputs/specs should only produce a small change in outputs. Does any current LLM support this way of working?
Why always start with an LLM to solve problems? Using an LLM adds a judgment call, and (at least for now) those judgment calls are not reliable. For something like the motivating example in this article of "is this PR approved" it seems straightforward to get the deterministic right answer using the github API without muddying the waters with an LLM.
The key insight from production: LLMs excel at the "what should I do next given this unexpected state" decisions, but they're terrible at the mechanical execution. An agent that encounters a CAPTCHA, an OAuth redirect, or an anti-bot challenge needs judgment to adapt. But once it knows what to do, you want deterministic execution.
The evals discussion is critical. We found that unit-test style evals don't capture the real failure modes - agents fail at composition, not individual steps. Testing "does it correctly identify a PR link" misses "does it correctly handle the 47th message in a channel where someone pasted a broken link in a code block". Trajectory-level evals against real edge cases matter more than step-level correctness.
You get the benefit of AI CodeGen along with the determinism of conventional logic.
In mapping out the problems that need to be solved with internal workflows, it’s wise to clarify where probabilistic judgments are helpful / required vs. not upfront. If the process is fixed and requires determinism why not just write scripts (code-gen’ed, of course).
So we gave the Tasklet agent a filesystem, shell, code runtime, general purpose triggering system, etc so that it could build the automation system it needed.