FRESH

Hacker News

Home

Formal Methods and the Future of Programming

90 points by eatonphil

by Animats

0 subcomment

I used to do proof of correctness work, decades ago.[1] We had more proof automation than many of the later systems. The easy stuff was solved by the first SAT solver, the Oppen-Nelson simplifier. The harder stuff used the Boyer-Moore prover, which uses heuristics and previous lemmas to guide the theorem prover. The Boyer-Moore prover had to be helped along by suggesting lemmas, which it would try to prove and which would then be used for later proofs. That was the tough human job.
Later systems seemed to have less automation. What tended to go wrong with formal methods was that the people doing them were too into the formalism. I was working on this for a commercial project that wanted bug-free code, not as an academic. The academic projects tended to get too clever. They had the mathematician's bias of wanting a terse notation and not much case analysis. That's not a good solution. You want lots of automated grinding and don't want to need much insight. The clever people kept inventing new logics - modal logic, temporal logic, etc, - to avoid verbosity in paper and pencil proof work. That wasn't really all that helpful. The basic truth of this business is that most of the theorems are rather banal.
As the Jane Street people point out, there's a big advantage in having control of the language. You want the verification statements integrated into the programming language. In many systems, they're embedded in comments, in a different syntax than the programming language, or even in completely separate files. This adds unnecessary work. It's good to see them pushing this forward.
We were doing this too early, over 40 years ago. It took about 45 minutes of compute back then to build up number theory from a cold start with the Boyer-Moore prover. Now it takes less than a second.
[1] https://archive.org/details/manualzilla-id-5928072/page/n3/m...

by spenrose

2 subcomments

As usual, this paean to deductive reasoning (“formal methods”) leaves out its fundamental limit: how closely do the postulates and definitions fit the domain they purport to map? (“In theory, there is no difference between theory and practice. In practice ...”) My guess is that Jane Street maintains large code bodies where the mapping is 1:1, because the purpose of the code is to implement a deterministic algorithm. Many other coders work in such areas. But millions of us don’t: most UIs, most exploratory work, etc.
There is a movement parallel to formal methods to define acceptance criteria at high resolution but not logico-mathematically, which at least grapples with the mapping problem but can’t resolve it where the map isn’t the territory, which is most places. Has Google’s results page, with its extremely evolved internal optimization frameworks really hit an optimum? Could that prototype you whipped up to capture a hazy idea have better illustrated it? These questions are best answered by looking outside the system to what the system serves.

by jdw64

1 subcomments

In other words, because GEN AI a lot of code, the idea is to shift human value toward verification. Sometimes I think about what programming really is. In fact, learning programming itself is a huge challenge for a non English speaker like me. I have to rely on machine translation to understand English documents that have no translation. The materials in my language are about 5 to 6 years behind.
Now, since it's impossible to code review the tens of thousands of lines of code that AI produces, I see discussions about establishing an absolute rule like mathematical proofs. Reading this reminds me of Rust's borrow checker. In fact, after writing in Rust a few times, it often leads to bad practices where people use tricks to avoid the borrow checker.
Actually, when mathematical rigor goes too far, humans tend to find ways around it. An undereducated programmer like me is especially prone to that.
Looking back at this kind of attempt, it will probably result in writing code only for specific formalized answers. If it becomes that standardized, I'm not sure it will be able to respond to human needs.
I think these defensive programming attempts are fine, but I want to do offensive programming (I coined that term). You take risks, but you fix things quickly and ship. Believing that over time, it will become good enough. Of course, for established industries where accuracy matters and the scope of work is well defined, like Jane Street, the approach in this article is correct. In other words, because there is enough data to adequately model the market's demands
But for social losers like me trying to make money, constantly moving from place to place looking for gold mines, these kinds of methodologies seem like a luxury.Established businesses with mature modeling need highly educated and specialized personnel to continuously optimize. But I know that realistically, I can't keep up with that demand. So I look for places where modeling is unstructured, but I'm not sure if I can use this approach even then.

by Syzygies

0 subcomment

I nearly adopted OCaml for my current math research, in no small part because of Jane Street contributions. I instead chose Lean 4, as LLMs became able to help code in Lean. I give up a factor of two in performance (a gap likely to close) for what reads to me as the clearest expression of each algorithm. Lean is a beautiful language that makes OCaml read like Old English.
Lean is designed to be optionally verified; proof is its primary (but not only) application. It seems an impedance mismatch for Jane Street to explore this direction without also migrating to Lean 4.

by brap

3 subcomments

Whenever I read about formal specs it always seems to me like “write the same tests just in a different way”, or worse, “write the same implementation but in a different way”.
I guess doing things twice can help catch errors, but I fail to see what’s so special about formal specs if they can suffer from the exact same bugs as the tests/implementation.
I guess the root of the problem is if you want to formally prove that a program does something, you have to be very specific (heh) about what that something is, at which point you are basically just writing tests/implementation all over again.
I have been looking into this on and off for years now, and I keep feeling like I’m missing something, but I don’t know what it is.
Can anyone enlighten me?

by xvilka

1 subcomments

seL4 despite being formally verified contained gnarly bugs, like [1]. Thus, it's not a silver bullet as some people think. Yes, it improves the quality, but only one of the aspects of hardening software, like ASAN, SAST, fuzzing, strict languages, and so on.
[1] https://github.com/seL4/seL4/issues/1199

by awinter-py

0 subcomment

See previous kleppmann post https://martin.kleppmann.com/2025/12/08/ai-formal-verificati..., and yes, obviously anything that you can put in the typesystem or the linter, you should weigh doing so.
Hopefully we get more ergonomic ways to do this? Like of the tools listed in the post, dafny + iris are the closest to being industrial I think. And amzn S3 has a history of TLA use in-house I think. But we probably haven't seen the typescript in this space yet, a zero cost abstraction that drops into existing tools, and people genuinely prefer it to the old way.
(And custom linters are also still pretty bad to write. Like golangci-lint is a painful codebase, haven't tried semgrep but the rules engine seemed intimidating. I've yet to use an AST API that I liked)

by mark4

0 subcomment

Relevant keynote I wrote about
Formal methods won, now what? https://muratbuffalo.blogspot.com/2026/04/bugbash26-keynote....

by reinitctxoffset

0 subcomment

I caught the religion on using types in conjunction with LLMs about eighteen months ago, but I only really got serious about `lean4` like six to eight months ago and now I wouldn't even consider using AI assist in software work with a `CIC` proof substrate that has practical C/C++ (and therefore, everything) FFI.
We've banned everything from JSON to Python, rewritten `nix` to have a compiler, and almost everything we write is not only property tested and multiply fuzzed to within an inch of it's life, but we have proofs in `lean4` that at a minimum drive property tests via `.olean` linkage, and when we have the bandwidth we prove exhaustiveness over the domain and property test that.
We skip the whole C++/Rust thing because all of the fast stuff is generated from `lean4` and so it doesn't really matter (C++ has advantages when bugs in it are actually bugs in `lean4` code, but you could go either way).
This is a big departure, and I certainly don't blame anyone who is skeptical: "ban JSON and Python wtf?!?!", but we've done millions of lines (checked) of this stuff and AI + formal systems is a dramatically bigger leap than no AI -> AI and Python. The latter in our experience is not monotone in progress, the former is almost always monotone in progress.
And you can do wild shit, this is a formal proof of the polyhedral tensor calculus that is modeled by things like ISL and CuTe, and using that we can get swizzling and tiling using `mdspan` in C++23 on device and prove it's right (up to some L'Hopital arugment about the coverage which this example doesn't demonstrate well: https://github.com/b7r6/mdspan-cute
That in turn, well, it goes real fast. On the first try.
https://straylight.software/assets/lambda-hierarchy.svg

by yawaramin

1 subcomments

Quis custodiet ipsos custodes?

by eddiepete

2 subcomments

I wonder how formal methods can help us move faster with GenAI.
Is it that they can help write formulas faster? That they can help ensure formulas match the system they're modeling faster? If the problem you think formal methods will help with is sloppy code, isn't the verification code going to be sloppy as well, unless some (not sloppy) intelligence carefully confirms that the specification matches the target system, which was the labor that previously made formal methods too expensive? I guess I don't understand how the argument works if code was previously less sloppy and verification was too expensive, and now code is more sloppy, and there's more of it, but somehow the sloppy intelligence will make verification move fast enough to make it worthwhile.
Unless we have some non-sloppy intelligence that's less of a bottleneck on verification than humans, how are we in a better place?
Maybe it's that investing that huge amount of labor of verification by human experts is now worth it because so much code will be produced that uses the verification systems that the investment will now pay off. But that requires creating pretty general verification systems, such as type system verification or something (which is what they seem to be aimed at), rather than individually verifying software systems like the micro-kernel example.
In other words, maybe the play is to invest in reusable verification systems that can be run tons of times on new code and systems. If so, it's surprising that this wasn't always the strategy.

by bobkb

0 subcomment

I have been testing formal verification methods with multiple products. It will be great to also understand more about what’s tried and how it was done. For example attempting to verify the spec is what I have been trying to implement.

by pjmlp

0 subcomment

I am already seeing the uptake on Spec-Driven Development as the Rational Unified Process revenge.