It's even worse than that: non-junior devs are doing it as well.
From there, I include explicit steps for how to test, including manual testing, and unit test/E2E test commands. If it's something visual, I try to include at least a screenshot, or sometimes even a brief screen capture demonstrating the feature.
Really go out of your way to make the reviewer's life easier. One benefit of doing all of this is that in most cases, the reviewer won't need to reach out to ask simple questions. This also helps to enable more asynchronous workflows, or distributed teams in different time zones.
None of this is covered by code generation, nor by juniors submitting random PRs. Those are symptoms of juniors (not only) missing fundamentals. When we forget what the job actually is, we create misalignment with junior engineers and end up with weird ideas like "spec-driven development"
If anything, coding agents are a wake-up call that clarify what engineering profession is really about
Testing only “proves” correctness for the specific state, environment, configuration, and inputs the code was tested with. In practice that only tests a tiny portion of possible circumstances, and omits all kinds of edge and non-edge cases.
Is anyone else seeing this in their orgs? I'm not...
That's why I refuse to take part in it. But I'm an old-world craftsman by now, and I understand nobody wants to pay for working, well-thought-out code any more. They don't want a Chesterfield; they want plywood and glue.
Therefore you must verify it works as intended in the real world. This means not shipping code and hoping for the best, but checking that it actually does the right thing in production. And on top of that, you have to verify that it hasn't caused a regression in something else in production.
You could try to do that with tests, but tests aren't always feasible. Therefore it's important to design fail-safes into your code that ALERT YOU to unexpected or erroneous conditions. It needs to do more than just log an error to some logging system you never check - you must actually be notified of it, and you should consider it a flaw in your work, like a defective pair of Nikes on an assembly line. Some kind of plumbing must exist to take these error logs (or metrics, traces, whatever) and send it to you. Otherwise you end up producing a defective product, but never know it, because there's nothing in place to tell you its flaws.
Every single day I run into somebody's broken webapp or mobile app. Not only do the authors have no idea (either because they aren't notified of the errors, or don't care about them), there is no way for me to even e-mail the devs to tell them. I try to go through customer support, a chat agent, anything, and even they don't have a way to send in bug reports. They've insulated themselves from the knowledge of their own failures.
Strong disagree here, your job is to deliver solutions that help the business solve a problem. In _most_ cases that means delivering code that you should be able to confidently prove satisfies the requirements like the OP mentioned, but I think this is an important nitpick distinction I didn't understand until later on in my career.
I vibe code a lot of stuff for myself, mostly for viewing data, when I don’t really need to care how it works. I’m coming around to the idea that outside of some specific circumstances where everyone has agreed they don’t need to care about or understand the code, team vibe coding is a bad practice.
If I’m paying an engineer, it’s for their work, unless explicitly agreed otherwise.
I think vibe coding is soon going to be seen the same way as “research” where you engage an offshore team (common e.g. in consulting) to give you a rundown on some topic and get back the first five google search results. Everyone knows how to do that, if it’s what they wanted they wouldn’t be hiring someone to do it.
Manual and automatic testing are still both required, but you must explicitly ensure that security considerations are included in those tests.
The LLM doesn't care. Caring is YOUR job.
The root cause is the second problem; short of formal verification you can never exhaustively prove that your code works. You can demonstrate and automate that demonstration for a sensible subset of inputs and states and hope for the state of the world approximately staying that way (spoiler: it won't). This is why 100% test coverage in most cases is something bad. This is why sensible is the key operative attitude, which LLM suck at right now.
The root cause of that one is the third problem; your job is to solve a business problem. If your code is not helping the business problem, it actually is not working in the literal sense of the work. It is an artifact that does a thing, but it is not doing work. And since you're downstream of all the self-contradicting, ever changing requirements in a biased framing of a chaotic world, you can never prove or demonstrate that your code solves a business problem and that is the end state.
Agents love to cheat. That's an issue I don't see a horizon for change.
Here's Opus 4.5 trying to cheat its way out of properly implementing compatibility and cross-platform, despite the clear requirements:
https://gist.github.com/alganet/8531b935f53d842db98157e1b8c0...
> Should popen handles work with fgets/fread/fwrite? PHP supports this. Option A: Create a minimal pipe_io_stream device / Option B: Store FILE* in io_private with a flag / Option C: Only support pclose, require explicit stream wrapper for reads.
If I asked for compatibility, why give me options that won't fully achieve it?
It actually tried to "break check" my knowledge about the interpreter (test me if I knew enough to catch it), and proposed shortcuts all the way through the chat.
I don't want to have to pepper my chats with variations on "don't cheat". I mean, I can do it, but it seems like boilerplate.
I wish I had some similar testing-related chats to share. Agents do that all the time.
This is the major blocker right now for AI-assisted automated verification, and one of the reasons why this isn't well developed beyond general directions (give it screenshots, make it run the command, etc).
If someone's not even interested and excited to see their code work, they are in the wrong profession.
I know Simon follows this "Issue First" style of work in his projects, with a strong requirement for passing tests to be included.
It's been a best practice for a long time. I really enjoyed this when I read it ~10 years ago, and it still stands the test of time:
https://rfc.zeromq.org/spec/42/#24-development-process
The rationale was articulated clearly in:
https://hintjens.gitbooks.io/social-architecture/content/cha...
If you have time, do yourself a favour and read the whole lot. And then liberally copy parts of C4 into your own process. I have advocated for many components of it, in many contexts, at $employer, and will continue to do so.
A colleague was working on an important subsystem and would ask Djikstra for a review when he thought it was ready. Djikstra would have to stop what he was doing, analyze the code, and would find a grievous error or edge case. He would point it out to the colleague who would then get back to work. The colleague would submit his code for review again and this could carry on enough times that Djikstra got annoyed.
Djikstra proposed a solution. His colleague would have to submit with his code some form of proof or argument as to why it was correct and ready to merge. That way Djikstra could save time by only having to review the argument and not all of the code.
There’s a way of looking at LLM output as Djikstra’s colleague. It puts a lot of burden on the human using this tool to review all of the code. I like Doctorow’s mental model of a reverse centaur. The LLM cannot reason and so won’t provide you with a sound argument. It can probably tell you what it did and summarize the code changes it made… but it can’t decide to merge those changes. It needs a human, the bottom half of the centaur, to do the last bit of work here. Because that’s all we’re doing when we let these tools do most of the work for us: we’re here to take the blame.
And all it takes is an implementation of what we’re trying to build already, every open source library ever, all of SO, a GW of power from a methane power plant, an Olympic pool of water and all of your time reviewing the code it generates.
At the end of the day it’s on you to prove why your changes and contributions should be merged. That’s a lot of work! But there’s no shortcuts. Luckily you can reason while the LLMs struggle with that so use it while you can when choosing to use such tools.
I would go a step further: we need to deliver code that belongs. This means following existing patterns and conventions in the codebase. Without explicit instruction, LLMs are really bad at this, and it's one of the things that make it incredibly obvious to reviews that a given piece of code has been generated by AI.
[1] I.e., it should work
That may seem pedantic but that’s a huge difference. Code is a means to an end. If no-code suddenly became better than code through some miracle, that would be your job.
This also means that if one day AI stops making mistakes, tossing AI requests over the wall may be a legitimate modus operandi.
Overall, this hits the nail on the head about not delivering broken code and providing automated tests. Thanks for putting your thoughts on paper.
I’m experimenting with how to get these into a PR, and the “gh” CLI tool is helpful.
Does anyone have a recipe to get a coding agent to record video of webflows?
Remember, code does 2 things: 1. Tell the machine what to do 2. Tell the next developer what you were trying to do
In fact, if any bugs were found by the official "last step" QA Department, we (as a software development department) were dinged. If QA found bugs, they could stop the entire product release, so you did not want to be responsible for that.
This resulted in each software development department setting up their own, internal "QC team" of testers. If they found bugs, then individual programmers (or teams) would get dinged, but the main department would not.
Our software got a lot of testing.
If you, the development engineer, haven't demonstrated the product to work as expected, and preferably this testing is independently confirmed by a product test group, then you can't claim to be delivering a functional product.
I would add though, that management, specifically marketing management setting unreasonable demands and deadlines, are a bigger threat to testing than LLMs.
Of course the damage done by LLM generated code not being tested, is additive to the damage management is doing.
So this isn't any kind of apologism, the two sources are both making the problem worse.
Outside in testing is great but I typically do automated outside in testing and only manual at the end. The loop process of testing needs to be repeatable and fast, manual is too slow
I'm noticing something else very similar but involving not necessarily junior roles with long messages, when they use these AI writing assistants that resume stuff, creates follow-ups, etc. Putting this additional burden in whoever needs to read it. It makes me think of a quote that says: "I would have written a shorter letter, but I didn't have the time."
It's not my job, really. And given by the state of IT these days it's barely anyone's it seems.
Seriously, this cannot be emphasized enough. Before LLMS when we were writing tests completely manually, manual testing made sense to me as the second step. However after playing around a lot with coding agents and LLMs, I fully agree this has flipped. Test it manually first! When you generate the tests it is extremely wise to ensure that the tests fail without the new code, and pass with it. You definitely need to review the test though, because it's remarkably easy to have the agent put something in there that makes it not a good test.
Just a couple days ago for example, Claude made a test pass by skipping authentication and leaving a brief comment informing that the authentication made the test flaky. It even threw a quick variable in there that enabled running or disabling flaky tests, and flaky tests were disabled by default! Had I not been doing a good review, I definitely would have missed it because it was cleverly subtle. I've also seen it test the wrong endpoint!
As I figure out my manual testing, I'll write out the steps that I took in my PR.
I've found that writing it out as I go does two things: 1) It makes it easier to have a detailed PR and 2) it acts as a form of rubber-ducking. As I'm updating my PR I'll realize steps I've missed in my testing.
Something that also helped out with my manual testing skill was working in a place that had ZERO automated testing. Every PR required a detailed testing plan that you did and that your reviewer could re-create.
That’s the thing. People exposing such rude behavior usually are not, or haven’t been in a looong time…
As for the local testing part not being performed, this is a slippery slope I’m fighting everyday: more and more cloud based services and platforms are used to deploy software to run with specific shenanigans and running it locally requires some kind of deep craft and understanding. Vendor lock-in is coming back in style (e.g. Databricks)
A bit clunky, but I think that can be scaled from individual lines of code to features or entire systems, whatever you are responsible for delivering, and encompasses all the processes that go around figuring what code is to be actually written and making sure it does what it's supposed to.
Trust and accountability are absolutely a critical aspect of software engineering and the code we deliver. Somehow that is missed in all the discussions around AI-based coding.
The whole phenomenon of AI "workslop" is not a problem with AI, it's a problem with lack of accountability. Ironically, blaming workslop on AI rather than organizational dysfunction is yet another instance of shirking accountability!
My approach to coding agents is to prepare a spec at the start, as complete as possible, and develop a beefy battery of tests as we make progress. Yesterday there was a story "I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in hours". They had 9000+ tests. That was the secret juice.
So the future of AI coding as I see it ... it will be better than pre-2020, we will learn to spec and plan good tests, and the tests are actually our contract the code does what is supposed to do. You can throw away the code and keep the specs and tests and regenerate any time.
Otherwise you’ll end up in situations where it passes all test cases yet fails for something unexpected in the real world, and you don’t know why, because you don’t even know what’s going on under the hood.
Your job is to the deliver code up to specifications.
Not even checking the happy flow at least is of course gross negligence. But so is spending too much time on edge cases that no one will run into or person asking doesn’t want to pay for covering.
Kind of depressing how it has become such a trope of blaming juniors for every ill or bad habit. In all likelihood the reader of this comment has a number of terrible habits, working on teams with terrible habits, and juniors play zero part in it.
And, I mean, on that theme developers have been doing this for as long as we've had large teams. I've worked at a large number of teams where there was the fundamental principal that QA / UA holds responsibility. That they are responsible for tests, and they are responsible for bad code making it through to the product / solution. Developers -- grizzled, excellent-CV devs -- would toss over garbage code and call it a day.
How to prove it has been subject to some debate for the past century, the answer is that it's context dependent to what degree you will or even can prove the program and exposed identifiers correct. Programming is a communication problem as well as a math problem, often an engineering problem too. Only the math portion can be proved, the a small by critical amount engineering portion tested.
Communication is the most important for velocity it's the difference between hand rolling machine code and sshing into a computer halfway across the world having every tool you expect. If you don't trust that webdevs know what they are doing then you can be the most amazing dev in the world you but your actual ability to contribute will be hampered. The same is true of vibe coding, if people aren't on the same page as to what is and isn't acceptable velocity starts to slow down.
Languages have not caught up to AI tools, since AI operates well above the function level, what level would be appropriate to be named and signed off on? pull request and link to the chat as a commit? (what is wrong with that that could be fixed at the name level)
Honest communication is the most important. Amazon telling investors that they use TLA+ is just signaling that they "for realz take uptime very seriously guize", "we know distributed systems" and engineering culture. The honest reality is that they could prove all their code and not IMprove their uptime one lick, because most of what they run isn't their code. It's a communication breakdown if effort gets spent on that outside a research department.
My takeaway from your blog post yesterday was that with a robust enough testing system the LLM can do the entire thing while I do Christmas with the family.
(Before all the AI fans come in here. I'm not criticizing AI.)
I'd go further: what's valuable is code review. So review the AI agent's code yourself first, ensuring not only that it's proven to work, but also that it's good quality (across various dimensions but most importantly in maintainability in future). If you're already overwhelmed by that thousand-line patch, try to create a hundred-line patch that accomplishes the same task.
I expect code review tools to also rapidly change, as lines of code written per person dramatically increase. Any good new tools already?
It often takes discipline to think and completely map out solutions before you build. This is where experience and knowing common patterns can also help.
When you have the experience of having manually written or read a lot of code it helps at the very least quickly understand what the LLMs are writing and reason about it later even if not at the beginning.
That is part of it, yes, but there are many others, such as ensuring that the new code is easy to understand and maintain by humans, makes the right tradeoffs, is reasonably efficient and secure, doesn't introduce a lot of technical debt, and so on.
These are things that LLMs often don't get right, and junior engineers need guidance with and mentoring from more experienced engineers to properly learn. Otherwise software that "works" today, will be much more difficult to make "work" tomorrow.
That's really not a great development for us. If our main point is now reduced to accountability over the result with barely any involvement in the implementation - that's very little moat and doesn't command a high salary. Either we provide real value or we don't ...and from that essay I think it's not totally clear what the value is - it seems like every QA, junior SWE or even product manager can now do the job of prompting and checking the output.
I'd buttress this statement with a nuance. Automated tests typically run in their entirety, usually by a well-known command like cargo test or at least by the CI tools. Manual tests are often skipped because the test seems to be far away from the code being changed.
My all-time favorite team had a rule that your code didn't exist if it didn't have automated tests to "defend" it. If it didn't, it was OK, or at least not surprising, for someone else to break or refactor it out of existence (not maliciously, of course).
call me the worst junior dev in the industry, but pre-coding agents, closing tickets was more important than upholding absolute quality. not everybody is dealing with a billion concurrent users with multi-geo deployments. most of the time, a few screenshots or test output for manual validation is enough to go ahead. when pressed with time and without the prerequisites in the infra side, doing the absolute best development and testing is a luxury only for daydreamers.
automated testing can be a double-edged sword. pre-LLM, even test coverage was a number that somehow needed to go up after each PR. this only resulted in shady tactics of pointless test cases that slowly bring up the metric. today it can be very dangerous if both code and its test suite are vibe coded. especially when it can give the appearance of that 90%+ code coverage.
on the other hand, some manual testing to make sure the core functionality works is the bare minimum one does before pushing out code. at least i would like to believe it is.
The submitter should understand how it works and be able to 'own' and review modifications to it. That's cognitive work submitters ipso facto don't do by offloading the understanding to an LLM. That's the actual hard work reviewers and future programmers have to do instead.
I prefer to make this probabilistic: use testing to reduce the probability that your code isn't correct, for the situations in which it is expected to be deployed. In this sense, coding and testing is much like doing experimental physics: we never really prove a theory or disprove it, we just invalidate clearly wrong ones.
LLMs do make mistakes. They do a sloppy job at times.
But give it a year. Two years. five years. It seems unreasonable to assume they will hit a plateau that will prevent them from being able to build, test, and ship code better than any human on earth.
I say this because it's already happened.
It was thought impossible for a computer to reach the point of being able to beat a grandmaster at chess.
There was too much "art," experience, and nuance to the game that a computer could ever fully grasp or understand. Sure there was the "math" of it all, but it lacked the human intuition that many thought were essential to winning and could only be achieved through a lifetime of practice.
Many years following Deep Blue vs. Garry Kasparov, the best players in the world laugh at the idea of even getting close to beating Stockfish or any other even mediocre game engine.
I say all of this as a 15-year developer. This happens over and over again throughout history. Something comes along to disrupt an industry or profession and people scream about how dangerous or bad it is, but it never matters in the end. Technology is undefeated.
Claude, etc, works best with good tests that verify the system works. And so the code becomes in some ways the tests rather than the code that does the thing. If you're responsible for the thing, then 90% of your responsibility moves to verifying behavior and giving agents feedback.
How does this work? When expectations about the program's state vs its observable behavior diverge?
Depending on exactly what the author meant here, I disagree. Our first and default tool should be some form of lightweight automated testing. It's explicit (serves a form of spec and docs how to use the software), it's repeatable (manual testing is done once and it's result is invalidated moments later), and it's cost per minute of effort is more or less the same (most companies have the engineers do the tests, they are expensive).
Yes. There will be exceptions and exceptional cases. This author is not talking about exceptions and neither am I. They're not an interesting addition to this conversation.
How is this a issue, its genuinely common sense.
And...code that has been 100% reviewed, even if it was fully LLM generated.
I remember when I was working at a startup and a new engineer merged his code and it totally broke the service. I asked him if he ran his code locally first and he stared at me speechless.
Running the code locally is the easiest way to eliminate a whole series of silly bugs.
Like mentioned in the article adding a test and then reverting your change to make sure the test fails is really important, especially with LLMs writing tests. They are great at making things look like they work but completely don’t.
The point is to hire people who can own code and codebase. “Someone will review” is dead end.
This only happens because the software industry has fallen into the Religion of Speed. I see it constantly: justified corner-cutting, rushing shit out the door, and always loading up another feature/project/whatever with absolutely zero self-awareness. AI is just an amplifier for bad behavior that was already causing chaos.
What's not being said here but should be: discipline matters. It's part of being a professional and always precedes someone who can ship code that "just works."
[1] https://ia.net/*
If you are dumping AI slop on your team to sort through, you are creating drag on the entire team's efforts toward those positive outcomes.
As someone getting dumped upon, you probably should make the decision (in line with the objective to producing positive outcomes) to not waste your time weeding through that stuff.
Review everything else, make it clear that the mess is not reviewable, and communicate that upward if needed.
The title doesn't go far enough - slop (AI or otherwise) can work and pass all the tests, and still be slop.
along with
- the job was better titled as “Analyst Programmer” - you need both.
And
- you can make a changeset, but you have to also sell the change
Devs already know this. Tell this to Managers, CEOs and non-engineers who believed billions worth of marketing BS. Cos devs don't have voice most of the time. They set the timelines. The want to push this end results to their team/company. So that is the constraints devs are working with. So to them, NOT to us Simon. WE KNOW! :)
1) Amen 2) I wonder if this is isolated to junior dev only? Perhaps it seems like that because junior devs do more AI assisted coding than seniors?
Edit: I'm an idiot ignore me.
Your job is to solve customer problems. Their problems may only be solvable with code that is proven to work, but it is equally likely (I dare say even more likely) that their problem isn't best solved with code at all, or even solved with code that doesn't work properly but works well enough.
Just a wild thought, nothing serious.
We already delegate accountability to non-humans all the time: - CI systems block merges - monitoring systems page people - test suites gate different things
In practice accountability is enforced by systems, not humans.. humans are defintiely "blamed" after the fact, but the day-to-day control loop is automated.
As agents get better at running code, inspecting ui state, correlating logs, screenshots, etc they're starting to operationally be "accountable" and preventing bad changes from shipping and producing evidence when something goes wrong .
At some point humans role shifts from "i personally verify this works" to "i trust this verification system and am accountable for configuring it correctly".
Thats still responsibility, but kind of different from whats described here. Taken to a logical extreme, the arguement here would suggest that CI shouldn't replace manual release checklists
First of all, no it’s not. Your job is to help the company succeed. If you write code that works but doesn’t help the company succeed, you failed. People do this all the time. Resume padding, for example.
Sometimes it’s better for the business to have two sloppy PRs than a single perfect one. You should be able to deliver that way when the situation demands.
Second, no one is out there proving anything. Like formal software correctness proofs? Yeah nobody does that. We use a variety of techniques like testing and code review to try to avoid shipping bugs, but there’s always a trade off between quality and speed/cost. You’re never actually 100% certain software works. You can buy more nines but they get expensive. We find bugs in 20+ year old software.
I guess to me, it's either the case that LLMs are just another tool, in which case the already existing teachings of best practice should cover them (and therefore the tone and some content of this article is unnecessary) or they're something totally new, in which case maybe some of the already existing teachings apply, but maybe not because it's so different that the old incentives can't reasonably take hold. Maybe we should focus a little bit more attention on that.
The article mentions rudeness, shifting burdens, wasting people's time, dereliction. Really loaded stuff and not a framing that I find necessary. The average person is just trying to get by, not topple a social contract. For that, look upwards.