At least in my experience, they are good at imitating a "visually" similar style, but they'll hide a lot of coupling that is easy to miss, since they don't understand the concepts they're imitating.
They think "Clean Code" means splitting into tiny functions, rather than cohesive functions. The Uncle Bob style of "Clean Code" is horrifying
They're also very trigger-happy to add methods to interfaces (or contracts), that leak implementation detail, or for testing, which means they are testing implementation rather than behavior
1 - Surprising success when an agent can build on top of established patterns & abstractions
2 - A deep hole of "make it work" when an LLM digs a whole it can't get out of, and fails to anticipate edge cases/discover hidden behavior.
The same things that make it easier for humans to contribute code make it easier for LLMs to contribute code.
From https://github.com/feldera/feldera/blob/main/CLAUDE.md:
- Adhere to rules in "Code Complete" by Steve McConnell.
- Adhere to rules in "The Art of Readable Code" by Dustin Boswell & Trevor Foucher.
- Adhere to rules in "Bugs in Writing: A Guide to Debugging Your Prose" by Lyn Dupre.
- Adhere to rules in "The Elements of Style, Fourth Edition" by William Strunk Jr. & E. B. White
e.g., mentioning Elements of Style and Bugs in Writing certainly has helped our review LLM to make some great suggestions about English documentation PRs in the past.
(BTW: I was not around then, so if I'm guessing wrong here please correct me!)
Over time compilers have gotten better and we're now at the point where we trust them enough that we don't need to review the Assembly or machine code for cleanliness, optimization, etc. And in fact we've even moved at least one abstraction layer up.
Are there mission-critical inner loops in systems these days that DO need hand-written C or Assembly? Sure. Does that matter for 99% of software projects? Negative.
I'm extrapolating that AI-generated code will follow the same path.
You can't really run the experiment because to do it you have to isolate a bunch of software engineers and carefully measure them as they go through parallel test careers. I mean I guess you could measure it but it's expensive and time consuming and likely to have massive experimental issues.
Although now you can sort of run the experiment with an LLM. Clean code vs unclean code. Let's redefine clean code to mean this other thing. Rerun everything from a blank state and then give it identical inputs. Evaluate on tokens used, time spent, propensity for unit tests to fail, and rework.
The history of science and technology is people coming up with simple but wrong untestable theories which topple over once someone invents a thingamajig that allows tests to be run.
Instead, you should approach it as if instructing the agent to write "perfect" code (whatever that means in the context of your patterns and practices, language, etc.).
How should exceptions be handled? How should parameters be named? How should telemetry and logging be added? How should new modules to be added? What are the exact steps?
Do not let the agent randomly pick from your existing codebase unless it is already highly consistent; tell it exactly what "perfect" looks like.
No more than the exact order of items being placed in main memory matters now. This used to be a pretty significant consideration in software engineering until the early 1990s. This is almost completely irrelevant when we have ‘unlimited’ memory.
Similarly generating code, refactoring, implementing large changes are easy to a point now that you can just rewrite stuff later. If you are not happy about how something is designed, a two sentence prompt fixes it in a million line codebase in thirty minutes.
Other than spending more time on design, I also usually ask the agent to spawn a few subagents to review an implementation from different perspectives like readability, simplicity, maintainability, modularity etc, then aggregate and analyze their proposals and prioritize. It's not a silver bullet and many times there are no objective right answers, but it works surprisingly well.
DRY is a principle that comes up frequently. But is repetition really that bad when LLMs can trivially edit all instances of the pattern and keep them in sync? LLMs, by contrast, cannot understand a leaky abstraction - the typical result when you hastily apply DRY. So "clean code" in a era of LLMs might be mean more explicit and repetitive, less abstract.
I'm not a fan of Clean Code[1], but the only tip I can give is: Don't instruct the LLM to write code in the form of Clean Code by Robert Martin. Itemize all the things you view as clean code, and put that in CLAUDE.md or wherever. You'll get better luck that way.
[1] I'm also not that anti-Uncle-Bob as some are.
I threw in the towel last night and switched to codex, which has actually been following instructions.
Couple that with a self-correcting loop (design->code->PR review->QA review in playwright MCP->back to code etc), orchestrated by a swarm coordinator agent, and the quality increases even further.
When the training is across code with varying styles, it is going to take effort to get this technology performing in a standardized way, especially when what's possible changes every 3 months.