So, even if the proof is correct, you need to determine if the theorem is what you want. Making that determination requires expertise. Since you cannot "run the theorem", you cannot vibe-code your way through it. E.g., there is no equivalent of "web app seems to be working!" You have to actually understand what the theorems are saying in a deep way.
I can appreciate what he's getting at, but my utopian vision for the future is that we won't need to reinvent the wheel like this each time we want verified software! E.g. for high-consequence systems, the hard part of compiler correctness is already handled by the efforts of CompCert, and SystemVerilog assertions for the design guarantees of processors is becoming more commonplace.
On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
We have been building GFQL, an open source graph query language that can run at the compute tier seperate from whatever existing storage system (splunk, SQL, cs , ..), making graphs easy for interactive notebooks/dashboards/analytics/pipelines/agents without needing graph DBs, even at billion scales. Part of the trick is it is a bit unusual inside too because we are making cypher-style queries run with columnar operations (purely vectorized for both CPU + GPU mode via a multi-target dataframe abstraction layer), but that is taking a few innovations in the interpreter algorithm so we worry about subtle semantic bugs. Most of our bugs can surface with small (graph, query) combos, so this sounded perfect for Alloy's small scope hypothesis!
So we are having a natural experiment for which does more bug finding:
- asking Claude code to analyze new code or recently found issues and using that for unit test amplification
- asking Claude code to use the alloy model checker to prove out our space of (query, graph) for key methods
- asking Claude code to port a large cypher conformance test suite
So far, Claude Code as a test amplifier is doing so much better than the others that it's our Dr default preference. The real gem is when we make it do a '5 whys root cause analysis' on why some bug got through and then it does test amplification around several categories of potential error.
We found the conformance suite to be 'class imbalanced' by focusing on a couple features we are adding next, so jury still out on that one. And finally... Alloy hasn't really found anything. We suspect we need to change how we use Alloy to be more reflective of the kinds of bugs we find, but 0 was disappointing.
If anyone is into this kind of thing, happy to discuss/collab!
If we can use AI to automatically implement a formal spec, then that formal specification language has just become a programming language.
The question is, of course, what portion of programs are much easier than the worst case (to the point of making verification tractable). That is not known, but the results are not necessarily encouraging. Programs that have been deductively verified proof assistants were not only very small, and not only written with proofs in mind, but were also restricted to use simple and less efficient algorithms to make the proofs doable. They tend to require between 10x and 1000x lines of proof per line of code.
(an old blog post of mine links to some important results: https://pron.github.io/posts/correctness-and-complexity)
There is a belief that programs that people write and more-or-less work should be tractably provable, as that's what allowed writing them in the first place (i.e. the authors must have had some vague proof of correctness in mind when writing them). I don't think this argument is true (https://pron.github.io/posts/people-dont-write-programs), and we use formal methods precisely when we want to close the gap between working more-or-less and definitely always working.
But even if verifying some program is tractable, it could still take a long time between iterations. Because it's reasonable that it would take an LLM no less than a month to prove a correct program, there's no point in stopping it before. So a "practical" approach could be: write a program, try proving it for a month, and if you haven't finished in a month, try again. That, of course, could mean waiting, say, six months before deciding whether or not it's likely to ultimately succeed. Nevertheless, I expect there will be cases where writing the program and the proof would have taken a team of humans 800 years, and an LLM could do it in 80.
1. https://chatgpt.com/share/696b7f8a-9760-8006-a1b5-89ffd7c5d2...