We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:
- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it
- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis
- Get better ai-based review: greptile and bugbot and half a dozen others
- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.
None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.
But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.
>And six months later you discover you’ve built exactly what the spec said — and nothing the customer actually wanted.
That's not a developer problem, it's a PM/business problem. Your PM or equivalent should be neck deep in finding out what to build. Some developers like doing that (likely for free) but they can't spend as much time on it as a PM because they have other responsibilities, so they are not as likely not as good at it.
If you are building POCs (and everyone understands it's a POC), then AI is actually better getting those built as long as you clean it up afterwards. Having something to interact with is still way better than passively staring at designs or mockup slides.
Developers being able to spend less time on code that is helpful but likely to be thrown away is a good thing IMO.
Obviously that could only work in a high trust environment, that why open source suffers so much with AI submissions.
Clause and friends represent an increase in coders, without any corresponding increase in code reviewers. It's a break in the traditional model of reviewing as much code as you submit, and it all falls on human engineers, typically the most senior.
Well, that model kinda sucked anyways. Humans are falliable and Ironies of Automation lays bare the failure modes. We all know the signs: 50 comments on a 5 line PR, a lonely "LGTM" on the 5000 line PR. This is not responsible software engineering or design; it is, as the author puts it, a big green "I'm accountable" button with no force behind it.
It's probably time for all of us on HN to pick up a book or course on TLA+ and elevate the state of software verification. Even if Claude ends up writing TLA+ specs too, at least that will be a smaller, simpler code base to review?
Verifying that they all work can be done in many ways, most of them high-touch - but to me the most effective way is to build a test suite.
And the best way to get a test suite while building is Test Driven Development TDD (with the key trait that you witnessed the tests fail before making them pass, giving you proof they actually prove something about your code) is a high leverage way to ensure details are documented and codified in a way that requires “zero tokens at rest”. If a test fails, something has been un-built; something has regressed. Conversely if all tests pass, your agent burned zero tokens learning that.
The industry will keep inventing other solutions but we have this already, so if you’re in the know, you should use it.
If you’re wondering how to get started you (or your agent) can crib ideas from what I’ve done & open sourced: https://codeleash.dev/docs/tdd-guard/
I currently work in a software field that has a large numerical component and verifying that the system is implemented correctly and stable takes much longer than actually implementing it. It should have been like that when I used to work in a more software-y role, but people were much more cavalier then and it bit that company in the butt often. This isn't new, but it is being amplified.
Perhaps this will finally force the pendulum to swing back towards continuous integration (the practice now aliased trunk-based development to disambiguate it from the build server). If we're really lucky, it may even swing the pendulum back to favoring working software over comprehensive documentation, but maybe that's hoping too much. :-)
We're building our tooling around it (thanks, Claude!) and seeing what works. Personally, I have my own harness and I've been focused on 1) discovering issues (in the broadest sense) and 2) categorizing the issues into "hard" and "easy" to solve inside the pipeline itself.
I found patterns in the errors the coding agents made in my harness, which I then exploited. I have an automated workflow that produces code in stages. I added structured checks to catch the "easy" problems at stage boundaries. It fixes those automatically. It escalates the "hard" problems to me.
In the end, this structure took me from ~73% first-pass to over 90%.
As an experiment, I had Claude Cowork write a history book. I chose as subject a biography of Paolo Sarpi, a Venetian thinker most active in the early 17th century. I chose the subject because I know something about him, but am far from expert, because many of the sources in Italian, in which I am a beginner, and because many of the sources are behind paywalls, which does not mean the AIs haven't been trained on them.
I prompted it to cite and footnote all sources, avoid plagiarism and AI-style writing. After 5 hours, it was finished (amusingly, it generated JavaScript and emitted a DOCX). And then I read the book. There was still a lingering jauntiness and breathlessness ("Paolo Sarpi was a pivotal figure in European history!") but various online checkers did not detect AI writing or plagiarism. I spot checked the footnotes and dates. But clearly this was a huge job, especially since I couldn't see behind the paywalls (if I worked for a Uni I probably could).
Finally, I used Gemini Deep Research to confirm the historical facts and that all the cited sources exist. Gemini thought it was all good.
But how do I know Gemini didn't hallucinate the same things Claude did?
Definitely an incredible research tool. If I were actually writing such a book, this would be a big start. But verification would still be a huge effort.
When you submit a PR, verifiability should be top of mind. Use those magic AI tools to make the PR as easy to possible to verify as possible. Chunk your PR into palatable chunks. Document and comment to aid verification. Add tests that are easy for the reviewer to read, test and tweak. Etc.
When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.
This is going to be way harder now vs. when we used to write the code ourselves. In contracting space, the problem now is that you may have a client that vibe coded an app and be very out of touch about the costs involved to have a developer approve it. It's going to be a hard sell, when the client builds the entire thing themselves and you are a mare peasant doing QA review.
With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.
I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.
Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.
Software development is a highly complex task and verification becomes not just validation of the output but also verification that the work is solving the problem desired, not just the problem specified.
I'm empathetic to that scenario, but this was a problem with software development to begin with. I would much rather be in a situation of reducing friction to verification than reducing friction to discovery.
Cognitive load might be the same but now we get a potential boost in productivity for the same cost.
Hand crafted , scalable code will be a very rare phenomenon
There will be a clear distinction between too.
Recent Devstral 2 (mistral) is pretty precise and concise in it's changes.