FRESH

Hacker News

Home

Formal Verification Gates for AI Coding Loops

137 points by pyrex41

by alexpotato

2 subcomments

I work in DevOps at a firm that has been very enthusiastic about using LLMs (in the good sense).
The phases were basically:
- try out having the LLM do "a lot"
- now even more
- now run multiple agents
- back to single agents but have the agents build tools
- tools that are deterministic AND usable by both the humans (EDIT: and the LLMs)
The reasons:
1. Deterministic tools (for both deployments and testing) get you a binary answer and it's repeatable
2. In the event of an outage, you can always fall back to the tool that a human can run
3. It's faster. A quick script can run in <30 seconds but "confabulating" always seemed to take 2-3 minutes.
Really, we are back to this article: https://spawn-queue.acm.org/doi/10.1145/3194653.3197520 aka "make a list of tasks, write scripts for each task, combine the scripts into functions, functions become a system"

by singron

1 subcomments

These guard types are great and I've heavily used them in the past. But why codegen them?
E.g. the jwt auth example has some major problems since the verification rules aren't fully specified in the spec. The jwt-token verified rule only checks that the string isn't empty, but it doesn't actually verify that it is correctly parsed, non-expired, and signed by a trusted key. The authenticated-user rule doesn't check that the user-id actually came from the jwt. If you hand-wrote your constructor, you would ensure these things. Similarly, all the other constructors allow passing in whatever values you like instead of checking the connections of the real objects.
By calling the constructor for these types, you are making an assertion about the relationship of the parameter values. If AI is calling the constructor, then it's able to make it's own assertions and derive whatever result it wants. That seems backwards. AI should use the result of tenant-access to deduce that a user is a member of tenant, but if they can directly call `(tenant-access user-id tenant-id true)`, then they can "prove" tenant-access for anything. In the past, we have named the constructors for these types `TenantAccess.newUnverified`, and then heavily scrutinized all callers (typically just jwt-parsers and specific database lookups). You can then use `TenantAccess.{userId,tenantId}` without scrutiny elsewhere.

by solomonb

3 subcomments

This is really cool, but why wouldn't you just use a more richly typed target language and skip this process? You could use Liquid Haskell (for refinement types) or Lean (for full dependent types) and be able to put these invariants directly in your program rather then in a sidecar.

by max_unbearable

1 subcomments

This matches my experience running coding agents daily: the agent is reliable at producing the shape of a thing and unreliable at holding an invariant across a long loop. Moving the rule into a type the compiler won't violate works because it relocates the check to the one place the loop can't quietly skip.
But singron's JWT point is the real limit. Backpressure doesn't remove the judgment, it moves it. The type still has to be written by someone who understands the actual invariant, and a guard type that compiles while only checking "string is non-empty" gives you the feeling of a gate with none of the guarantee. The compiler enforces what you encoded, not what you meant.
So this reads less like formal verification and more like forcing the human judgment to land up front, in the type definitions, instead of hoping it survives in a prompt. That's still a real win: a constraint in a type is reviewable and permanent, while a prompt rule decays the moment the context window moves past it. Worth being honest, though, that the hard part doesn't go away. It just changes address.

by MetaverseClub

0 subcomment

My naive thoughts on Formal Verification (FV) and LLM:
We could leverage existing FV tools for a given programming language by using an LLM to generate a translator that maps the language to the FV tool's input format. This would essentially require a finite number of "abstract interpretation" functions—one for each language construct. While the total number of constructs might be large (e.g., around 500), each function would be independent. A function would only need to reason about the abstract semantics of a single construct, assuming the others adhere to their respective semantics. We could then distribute these LLM-generated functions among a group of experts (e.g., 100 reviewers). Thanks to the modularity of the functions, reviewers could evaluate their assigned subset in parallel without bottlenecks. The end result would be a working FV tool for the target language.

by jcurbo

0 subcomment

I've had similar thoughts lately while using agents for coding. It seems like if I give the agents external tools and sources of truth they perform really well. In my case it was a lot of standards, extensive use of Rust language features (e.g. traits), formal methods tools, large-scale architecture and specification documents (accessed via a special tool I also wrote), and using the beads_rust tool for tracking session state and giving the agent a list of issues to work. I hadn't thought of this as 'backpressure' but I really like that framing. It gives the agents guardrails and context and external validity that it can't make up.

by appstorelottery

0 subcomment

I think it's all about keeping state in the determinant space. I've come across the same issue, the key was to not rely on LLM performing workflow - the runtime needs to enforce.

by cpard

0 subcomment

What I've learned so far is that the tooling mentioned in this article can have dual purpose. Sure you can use it to gate the agent but you can also use them as tools for the agents to understand what they are working with at a more abstract level and with less tokens required.
In a way, you push some of the reasoning into deterministic tooling which is great both for reliability and performance.
At least on what I've been working on where I ended up creating type systems over SQL to solve some of the annoying issues I was having with agents reasoning over complex data infrastructure.

by eximius

0 subcomment

So, capabilities/type systems. Building code architecture guardrails steep enough the AI won't jump the fence/take shortcuts.

by ianhorn

0 subcomment

I've been experimenting with this a lot lately in Lean because it's equally capable as a theorem prover and as a programming language. It's resolving a lot of the frustration I feel with LLM coding.
You write a type signature for a function that amounts to "take a Foo x and return a Bar y with a proof of does_what_i_wanted(x,y)." Voila, no more agents doing something else because it won't compile if they don't do what I wanted.
It's great to build faster without the frustration of having no confidence in what I build. But it sure makes the gap between toys in Lean and using this in a Real Project in some other language that much more frustrating.

by genxy

0 subcomment

Great article. Backpressure is a rate control signal, not a correctness feedback control signal. Backpressure resists the flow of something down a pipe, what you need is an error control signal that tells upstream what error downstream is detecting (your FV oracle).

by vrm

1 subcomments

One question I have here: I think this type of thing would be trivial to do in Rust with constructors, private fields, and newtypes. What am I getting on top of it?

by pyrex41

5 subcomments

Author here. The TL;DR: move rules from prompts into types the compiler refuses to violate, then bounce the AI coding loop off those refusals. The repo is github.com/pyrex41/Shen-Backpressure. Builds a lot on Geoff Huntley's backpressure idea -- none of this is rocket science, just an effort to apply sound programming principles in a world of LLM coding agents.

by KaiShips

0 subcomment

[flagged]

by donbventures

0 subcomment

[flagged]

by levelhabits

0 subcomment

[flagged]

by zane_shu

0 subcomment

[flagged]