FRESH

Hacker News

Home

Verification debt: the hidden cost of AI-generated code

112 points by xfz

by fishtoaster

13 subcomments

Figuring out how to trust AI-written code faster is the project of software engineering for the next few years, IMO.
We'll need to figure out the techniques and strategies that let us merge AI code sight unseen. Some ideas that have already started floating around:
- Include the spec for the change in your PR and only bother reviewing that, on the assumption that the AI faithfully executed it
- Lean harder on your deterministic verification: unit tests, full stack tests, linters, formatters, static analysis
- Get better ai-based review: greptile and bugbot and half a dozen others
- Lean into your observability tooling so that AIs can fix your production bugs so fast they don't even matter.
None of these seem fully sufficient right now, but it's such a new problem that I suspect we'll be figuring this out for the next few years at least. Maybe one of these becomes the silver bullet or maybe it's just a bunch of lead bullets.
But anyone who's able to ship AI code without human review (and without their codebase collapsing) will run circles around the rest.

by hnthrow0287345

2 subcomments

This still seems like technical debt to me. It's just debt with a much higher compounding interest rate and/or shorter due date. Credit cards vs. traditional loans or mortgages.
>And six months later you discover you’ve built exactly what the spec said — and nothing the customer actually wanted.
That's not a developer problem, it's a PM/business problem. Your PM or equivalent should be neck deep in finding out what to build. Some developers like doing that (likely for free) but they can't spend as much time on it as a PM because they have other responsibilities, so they are not as likely not as good at it.
If you are building POCs (and everyone understands it's a POC), then AI is actually better getting those built as long as you clean it up afterwards. Having something to interact with is still way better than passively staring at designs or mockup slides.
Developers being able to spend less time on code that is helpful but likely to be thrown away is a good thing IMO.

by SPascareli13

0 subcomment

My current stance with reviewing code is: It's not ok to make another human review the code you made with AI, if you used AI then you're the reviewer, so unless you come to me with a well defined question or decision to make, just merge it and take responsibility.
Obviously that could only work in a high trust environment, that why open source suffers so much with AI submissions.

by saltpath

0 subcomment

There's a related but distinct problem downstream: once the agent is running in production, verification debt shifts from code to execution. Internal logs of what the agent called and what it received are mutable — if a provider disputes delivery or compliance requires an audit trail, "we have logs" is a weak defense. The deterministic verification (tests, linters, CI) handles the code side. The execution side is a different problem: you need immutable witnesses at call time, before the agent proceeds, not post-hoc reconstructions.

by jldugger

1 subcomments

Verification debt has always been present, we just now feel an acute need for it, because we do it wrong.
Clause and friends represent an increase in coders, without any corresponding increase in code reviewers. It's a break in the traditional model of reviewing as much code as you submit, and it all falls on human engineers, typically the most senior.
Well, that model kinda sucked anyways. Humans are falliable and Ironies of Automation lays bare the failure modes. We all know the signs: 50 comments on a 5 line PR, a lonely "LGTM" on the 5000 line PR. This is not responsible software engineering or design; it is, as the author puts it, a big green "I'm accountable" button with no force behind it.
It's probably time for all of us on HN to pick up a book or course on TLA+ and elevate the state of software verification. Even if Claude ends up writing TLA+ specs too, at least that will be a smaller, simpler code base to review?

by cadamsdotcom

0 subcomment

Software is a huge collection of tiny, curated details.
Verifying that they all work can be done in many ways, most of them high-touch - but to me the most effective way is to build a test suite.
And the best way to get a test suite while building is Test Driven Development TDD (with the key trait that you witnessed the tests fail before making them pass, giving you proof they actually prove something about your code) is a high leverage way to ensure details are documented and codified in a way that requires “zero tokens at rest”. If a test fails, something has been un-built; something has regressed. Conversely if all tests pass, your agent burned zero tokens learning that.
The industry will keep inventing other solutions but we have this already, so if you’re in the know, you should use it.
If you’re wondering how to get started you (or your agent) can crib ideas from what I’ve done & open sourced: https://codeleash.dev/docs/tdd-guard/

by ironman1478

0 subcomment

Verification has always been hard and always ignored, in software more than other industries. This is not specific to AI generated code.
I currently work in a software field that has a large numerical component and verifying that the system is implemented correctly and stable takes much longer than actually implementing it. It should have been like that when I used to work in a more software-y role, but people were much more cavalier then and it bit that company in the butt often. This isn't new, but it is being amplified.

by Kerrick

0 subcomment

> It gets 50% more pull requests, 50% more documentation, 50% more design proposals
Perhaps this will finally force the pendulum to swing back towards continuous integration (the practice now aliased trunk-based development to disambiguate it from the build server). If we're really lucky, it may even swing the pendulum back to favoring working software over comprehensive documentation, but maybe that's hoping too much. :-)

by hamasho

0 subcomment

I've been spending much less time on reviews lately. I used to check if the code was correct and well-written, and worked on my local machine as expected and performed well. But I can't do it anymore. If they can vibe-code, why can't I vibe-review? Maybe something wrong will happen in production, but it's not my responsibility. I also stopped volunteering for on-call (well, I shouldn't in the first place). If I noticed someone reporting a bug in production during non-working hours, I investigated and implemented the solution, usually faster than coworkers. I thought it was my responsibility to contribute to the product if I could, even though it was beyond my job description. Working with AI-generated code really demoralized me and I can't love the product I'm working on anymore.

by talkvoix

1 subcomments

With a CS degree and 15 years of software engineering under my belt, I was initially skeptical of 'vibe coding'. But the article is right about this adolescent phase. I recently built my platform (https://voix.chat) 100% through agentic workflows. Having that much experience meant I didn't use the AI as a crutch to learn how to code; I used it as a hyper-productive junior dev while I played the paranoid senior architect. It allowed me to focus purely on the hard stuff: strict anti-flood mechanisms, brute-force protection, and overall server hardening. The AI handles the syntax; the human handles the paranoia.

by mrothroc

0 subcomment

Everyone is circling around this. We are shifting to "code factories" that take user intent in at one end and crank out code at the other end. The big question: can you trust it?
We're building our tooling around it (thanks, Claude!) and seeing what works. Personally, I have my own harness and I've been focused on 1) discovering issues (in the broadest sense) and 2) categorizing the issues into "hard" and "easy" to solve inside the pipeline itself.
I found patterns in the errors the coding agents made in my harness, which I then exploited. I have an automated workflow that produces code in stages. I added structured checks to catch the "easy" problems at stage boundaries. It fixes those automatically. It escalates the "hard" problems to me.
In the end, this structure took me from ~73% first-pass to over 90%.

by johngossman

3 subcomments

This verification problem is general.
As an experiment, I had Claude Cowork write a history book. I chose as subject a biography of Paolo Sarpi, a Venetian thinker most active in the early 17th century. I chose the subject because I know something about him, but am far from expert, because many of the sources in Italian, in which I am a beginner, and because many of the sources are behind paywalls, which does not mean the AIs haven't been trained on them.
I prompted it to cite and footnote all sources, avoid plagiarism and AI-style writing. After 5 hours, it was finished (amusingly, it generated JavaScript and emitted a DOCX). And then I read the book. There was still a lingering jauntiness and breathlessness ("Paolo Sarpi was a pivotal figure in European history!") but various online checkers did not detect AI writing or plagiarism. I spot checked the footnotes and dates. But clearly this was a huge job, especially since I couldn't see behind the paywalls (if I worked for a Uni I probably could).
Finally, I used Gemini Deep Research to confirm the historical facts and that all the cited sources exist. Gemini thought it was all good.
But how do I know Gemini didn't hallucinate the same things Claude did?
Definitely an incredible research tool. If I were actually writing such a book, this would be a big start. But verification would still be a huge effort.

by bryanlarsen

1 subcomments

Verification is the bottleneck now, so we have to adjust our tooling and processes to make verification as easy as possible.
When you submit a PR, verifiability should be top of mind. Use those magic AI tools to make the PR as easy to possible to verify as possible. Chunk your PR into palatable chunks. Document and comment to aid verification. Add tests that are easy for the reviewer to read, test and tweak. Etc.

by apical_dendrite

5 subcomments

My company recently hired a contractor. He submits multi-thousand line PRs every day, far faster than I can review them. This would maybe be OK if I could trust his output, but I can't. When I ask him really basic questions about the system, he either doesn't know or he gets it wrong. This week, I asked for some simple scripts that would let someone load data in a a local or staging environment, so that the system could be tested in various configurations. He submitted a PR with 3800 lines of shell scripts. We do not have any significant shell scripts anywhere else in our codebase. I spent several hours reviewing it with him - maybe more time than he spent writing it. His PR had tons and tons of end-to-end tests of the system that didn't actually test anything - some said they were validating state, but passed if a get request returned a 200. There were a few tests that called a create API. The tests would pass if the API returned an ID of the created object. But they would ALSO pass if the test didn't return an ID. I was trying to be a good teacher, so I kept asking questions like "why did you make this decision", etc, to try to have a conversation about the design choices and it was very clear that he was just making up bullshit rationalizations - he hadn't made any decisions at all. There was one particularly nonsensical test suite - it said it was testing X but included API calls that had nothing to do with X. I was trying to figure out how he had come up with that, and then I realized - I had given him a Postman export with some example API requests, and in one of the API requests I had gotten lazy and modified the request to test something but hadn't modified the name in Postman. So the LLM had assumed that the request was related to the old name and used it when generating a test suite, even though these things had nothing to do with each other. He had probably never actually read the output so he had no idea that it made no sense.
When he was first hired, I asked him to refactor a core part of the system to improve code quality (get rid of previous LLM slop). He submitted a 2000+ line PR within a day or so. He's getting frustrated because I haven't reviewed it and he has other 2000+ line PRs waiting on review. I asked him some questions about how this part of the system was invoked and how it returned data to the rest of the system, and he couldn't answer. At that point I tried to explain why I am reluctant to let him commit his refactor of a core part of the system when he can't even explain the basic functionality of that component.

by devld

0 subcomment

> Mostly I’m just there to press the big “I’m accountable” button on the screen
This is going to be way harder now vs. when we used to write the code ourselves. In contracting space, the problem now is that you may have a client that vibe coded an app and be very out of touch about the costs involved to have a developer approve it. It's going to be a hard sell, when the client builds the entire thing themselves and you are a mare peasant doing QA review.

by bensyverson

1 subcomments

It comes down to trust. I was not able to trust GPT 4.1 or Sonnet 3.5 with anything other than short, well-specified tasks. If I let them go too long (e.g. in long Cursor sessions), it would lose the plot and start thrashing.
With better models and harnesses (e.g. Claude Code), I can now trust the AI more than I would trust a junior developer in the past.
I still review Claude's plans before it begins, and I try out its code after it finishes. I do catch errors on both ends, which is why I haven't taken myself out of the loop yet. But we're getting there.
Most of the time, the way I "verify" the code is behavioral: does it do what it's supposed to do? Have I tried sufficient edge cases during QA to pressure-test it? Do we have good test coverage to prevent regressions and check critical calculations? That's about as far as I ever took human code verification. If anything, I have more confidence in my codebases now.

by veloryn

0 subcomment

I've noticed something similar using AI for coding. Writing the code becomes faster, but checking what it actually does takes longer than expected. Sometimes you spend more time reading and testing the generated code than writing it yourself. The speed is real, but the verification part doesn’t go away.

by abetusk

0 subcomment

Both empirically and theoretically, verification is often much more tractable than discovery.
Software development is a highly complex task and verification becomes not just validation of the output but also verification that the work is solving the problem desired, not just the problem specified.
I'm empathetic to that scenario, but this was a problem with software development to begin with. I would much rather be in a situation of reducing friction to verification than reducing friction to discovery.
Cognitive load might be the same but now we get a potential boost in productivity for the same cost.

by chromaton

0 subcomment

Historically, the cycle has been requirements -> code -> test, but with coding becoming much faster, the bottlenecks have changed. That's one of the reasons I've been working on Spark Runner to help automate testing for web apps: https://https://github.com/simonarthur/spark-runner

by maxdo

0 subcomment

Code is fully disposable way to generate custom logic.
Hand crafted , scalable code will be a very rare phenomenon
There will be a clear distinction between too.

by mentalgear

0 subcomment

> Output is mind-numbingly verbose. You ask for a focused change and get a dissertation with unsolicited comments and gratuitous refactoring.
Recent Devstral 2 (mistral) is pretty precise and concise in it's changes.

by ritcgab

0 subcomment

At the end of the day, it's about liability. Whether you use AI tools to generate the code or not, you are the author of the code, and such authorship implies the liability that you are being paid to take.

by poemxo

0 subcomment

We were verifying code before? And wouldn't AI help with verification at least for the trivial flaws?

by VanTodi

1 subcomments

I've come to the point where I think generated code is nothing better than a random package I install. Did I read it all and just accepted what was promised? Yes Can it bite me in the butt somewhere down the road? Probably, but I currently at least have more doubt about the generated code than a random package I picked up somewhere on git which readme I just partly skipped over.

by irenetusuq

0 subcomment

[dead]

by aplomb1026

0 subcomment

[dead]

by ClaudioAnthrop

0 subcomment

[dead]

by decker_dev

0 subcomment

[flagged]