This is emphatically not fundamental to LLMs! Yes, the next token is selected randomly; but "randomly" could mean "chosen using an RNG with a fixed seed." Indeed, many APIs used to support a "temperature" parameter that, when set to 0, would result in fully deterministic output. These parameters were slowly removed or made non-functional, though, and the reason has never been entirely clear to me. My current guess is that it is some combination of A) 99% of users don't care, B) perfect determinism would require not just a seeded RNG, but also fixing a bunch of data races that are currently benign, and C) deterministic output might be exploitable in undesirable ways, or lead to bad PR somehow.
1. Same input = same output. This can be called determinism, and it's technically rather trivial to achieve in the lifetime of a single model snapshot - it's just a matter of business need, because you pay extra for worse batching. It's harder if you need to extend the guarantee into the future, as you need to keep the snapshot and inference method the same. It's also a relatively niche thing, only required for build reproducibility, supply chain security, this kind of stuff.
2. Zero error rate with arbitrary inputs and outputs. This is not determinism and it's also NOT achievable in any model at all because the domain LLMs (and humans!) operate in is fundamentally ambiguous. If you want to enforce the formal rules, verify your inputs and outputs formally! Trying to solve it purely with intelligence (human or machine) is a fool's errand. You can keep the error rate low enough, but you can't guarantee the absence of errors due to the nature of intelligence.
I'm also reminded of the old software called Formulize, which could take in a set of arbitrary data and find a function that described it. http://nutonian.wikidot.com/
Obviously this won't work if your tools are not deterministic, but reproducible builds is a well-trodden discipline.
I'm finding code falls into two categories. Code that produces known results and code that produces results that are not known. For example, creating a table with a pagination component with a backend that loads the first 30 rows ordered by date descending from the database on page 1 and the second set of 30 rows on page 2. We know what the code is supposed to output, we know what the result looks like. On the other hand, there is code that does statistical analysis on the 30 rows of data. This is different because we don't know what the result is.
The known result code is easy to use an LLM with. I have a skill that will iterate with an OODA loop — observe, act, and validate. It will in the validate step take screenshots and even without telling it, it will query the database from the CLI, compare the rendered row data to the database data. It will more surprisingly make sure that all the components are responsive and render beautifully on mobile. I'm orders of magnitude past linting here which is solved with Biome.
The statistical analysis is different. The only way I can know for sure of the result is by writing the code painstakingly by hand. The LLM will always produce specious lies. It will fabricate and show me what I want to see, not the truth. This is because until it is written manually by hand, there is no ground truth. In this case, there is no code checking code.
So can't you just save the conversation transcript and replay it with the tools? Seems a lot more efficient that regenerating the whole thing. And, also, no risk of branching when a tool reply is slightly different. (Of course, errors can occur on subsequent runs.)
I think co-recursion between prompts and code is crucial, but I also think that the ephemeral nature of code in Recursive Language Models is impending deployment time learning (https://github.com/zby/llm-do/blob/main/kb/notes/deploy-time...).
I'm glad to see others talking about it. One day we'll look back on this era the same way folks look back at the time before we validated inputs.
https://www.stevenathompson.com/effective-vibe-coding-best-p...
LLMs really cause diminished reasoning, or in terms that LLM people might understand: Your minds have been quantized!
it goes on for ages just to reach the point of "write the tests first"
As the state travels across the graph, I keep a trace of the steps which were executed, which means that when an error happens, the agent has a lot more information than it normally would, it can see what decision points the code passed through already, it can cross references that with the declared workflow, and quickly find where it screwed up.
The idea of workflow engines has been around for a long time, but they feel too awkward to use when you're writing code by hand. Writing conditional logic directly in the code keeps you in your flow, and having to jump out and declare it in config somewhere feels awkward. Coding agents completely change the dynamic though because they don't have that problem. If the LLM is writing the code, then I can just focus on ensuring the code meets the contract, while the agent can deal with the implementation details.