The phases were basically:
- try out having the LLM do "a lot"
- now even more
- now run multiple agents
- back to single agents but have the agents build tools
- tools that are deterministic AND usable by both the humans (EDIT: and the LLMs)
The reasons:
1. Deterministic tools (for both deployments and testing) get you a binary answer and it's repeatable
2. In the event of an outage, you can always fall back to the tool that a human can run
3. It's faster. A quick script can run in <30 seconds but "confabulating" always seemed to take 2-3 minutes.
Really, we are back to this article: https://spawn-queue.acm.org/doi/10.1145/3194653.3197520 aka "make a list of tasks, write scripts for each task, combine the scripts into functions, functions become a system"
E.g. the jwt auth example has some major problems since the verification rules aren't fully specified in the spec. The jwt-token verified rule only checks that the string isn't empty, but it doesn't actually verify that it is correctly parsed, non-expired, and signed by a trusted key. The authenticated-user rule doesn't check that the user-id actually came from the jwt. If you hand-wrote your constructor, you would ensure these things. Similarly, all the other constructors allow passing in whatever values you like instead of checking the connections of the real objects.
By calling the constructor for these types, you are making an assertion about the relationship of the parameter values. If AI is calling the constructor, then it's able to make it's own assertions and derive whatever result it wants. That seems backwards. AI should use the result of tenant-access to deduce that a user is a member of tenant, but if they can directly call `(tenant-access user-id tenant-id true)`, then they can "prove" tenant-access for anything. In the past, we have named the constructors for these types `TenantAccess.newUnverified`, and then heavily scrutinized all callers (typically just jwt-parsers and specific database lookups). You can then use `TenantAccess.{userId,tenantId}` without scrutiny elsewhere.
But singron's JWT point is the real limit. Backpressure doesn't remove the judgment, it moves it. The type still has to be written by someone who understands the actual invariant, and a guard type that compiles while only checking "string is non-empty" gives you the feeling of a gate with none of the guarantee. The compiler enforces what you encoded, not what you meant.
So this reads less like formal verification and more like forcing the human judgment to land up front, in the type definitions, instead of hoping it survives in a prompt. That's still a real win: a constraint in a type is reviewable and permanent, while a prompt rule decays the moment the context window moves past it. Worth being honest, though, that the hard part doesn't go away. It just changes address.
We could leverage existing FV tools for a given programming language by using an LLM to generate a translator that maps the language to the FV tool's input format. This would essentially require a finite number of "abstract interpretation" functions—one for each language construct. While the total number of constructs might be large (e.g., around 500), each function would be independent. A function would only need to reason about the abstract semantics of a single construct, assuming the others adhere to their respective semantics. We could then distribute these LLM-generated functions among a group of experts (e.g., 100 reviewers). Thanks to the modularity of the functions, reviewers could evaluate their assigned subset in parallel without bottlenecks. The end result would be a working FV tool for the target language.
In a way, you push some of the reasoning into deterministic tooling which is great both for reliability and performance.
At least on what I've been working on where I ended up creating type systems over SQL to solve some of the annoying issues I was having with agents reasoning over complex data infrastructure.
You write a type signature for a function that amounts to "take a Foo x and return a Bar y with a proof of does_what_i_wanted(x,y)." Voila, no more agents doing something else because it won't compile if they don't do what I wanted.
It's great to build faster without the frustration of having no confidence in what I build. But it sure makes the gap between toys in Lean and using this in a Real Project in some other language that much more frustrating.