FRESH

Hacker News

Home

AI is forcing us to write good code

288 points by sgk284

by KurSix

9 subcomments

There's a catch with 100% coverage. If the agent writes both the code and the tests, we risk falling into a tautology trap. The agent can write flawed logic and a test that verifies that flawed logic (which will pass). 100% coverage only makes sense if tests are written before the code or rigorously verified by a human. Otherwise, we're just creating an illusion of reliability by covering hallucinations with tests. An "executable example" is only useful if it's semantically correct, not just syntactically

by tombert

3 subcomments

Something I just started doing yesterday, and I'm hoping it catches on, is that I've been writing the spec for what I want in TLA+/PlusCal at a pretty high level, and then I tell Codex implement exactly to the spec. I tell it to not deviate from the spec at all, and be as uncreative as possible.
Since it sticks pretty close to the spec and since TLA+ is about modifying state, the code it generates is pretty ugly, but ugly-and-correct code beats beautiful code that's not verified.
It's not perfect; something that naively adheres to a spec is rarely optimized, and I've had to go in and replace stuff with Tokio or Mio or optimize a loop because the resulting code is too slow to be useful, and sometimes the code is just too ugly for me to put up with so I need to rewrite it, but the amount of time to do that is generally considerably lower than if I were doing the translation myself entirely.
The reason I started doing this: the stuff I've been experimenting with lately has been lock-free data structures, and I guess what I am doing is novel enough that Codex does not really appear to generate what I want; it will still use locks and lock files and when I complain it will do the traditional "You're absolutely right", and then proceed to do everything with locks anyway.
In a sense, this is close to the ideal case that I actually wanted: I can focus on the high-level mathey logic while I let my metaphorical AI intern deal with the minutia of actually writing the code. Not that I don't derive any enjoyment out of writing Rust or something, but the code is mostly an implementation detail to me. This way, I'm kind of doing what I'm supposed to be doing, which is "formally specify first, write code second".

by tempodox

4 subcomments

This is hallucination. Or maybe a sales pitch. If production bugs and the requirement to retain a workable code base don’t get us to write “good” code, then nothing will. And at the current state of the art, “AI” will tend to make it worse.

by pgroves

6 subcomments

This is sort of why I think software development might be the only real application of LLMs outside of entertainment. We can build ourselves tight little feedback loops that other domains can't. I somewhat frequently agree on a plan with an LLM and a few minutes or hours later find out it doesn't work and then the LLM is like "that's why we shouldn't have done it like that!". Imagine building a house from scratch and finding out that it was using some american websites to spec out your electric system and not noticing the problem until you're installing your candadian dishwasher.

by afro88

2 subcomments

Without having tried it (caveat), I worry that 100% coverage to an LLM will lock in bad assumptions and incorrect functionality. It makes it harder for it to identify something that is wrong.
That said, we're not talking about vibe coding here, but properly reviewed code, right? So the human still goes "no, this is wrong, delete these tests and implement for these criteria"?

by mkozlows

2 subcomments

I like this. "Best practices" are always contingent on the particular constellation of technology out there; with tools that make it super-easy to write code, I can absolutely see 100% coverage paying off in a way that doesn't for human-written code -- it maximizes what LLMs are good at (cranking out code) while giving them easy targets to aim for with little judgement.
(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)

by danieka

3 subcomments

I thought that the article would be about if we want AI to be effective, we should write good code.
What I notice is that Claude stumbles more on code that is illogical, unclear or has bad variable names. For example if a variable is name "iteration_count" but actually contains a sum that will "fool" AI.
So keeping the code tidy gives the AI clearer hints on what's going on which gives better results. But I guess that's equally true for humans.

by melozo

0 subcomment

I’m not sure how controversial this is - but 100% code coverage is almost always a waste of time, paid both immediately and long term, for certain languages. Go, for example, requires explicit error handling, but the way errors are handled are usually plain and homogenous. Adding unit testing everywhere creates a phenomenal amount of test code that can become 3x the size of the source, and certain changes (like interface changes) can require updates to all tests, especially if mocking is used.
Obviously with AI maybe those issues I have go away. But I really don’t like letting the AI modify tests without meticulously manually reviewing those changes, because in my experience the AI cares more about getting the tests passing than it does about ensuring semantic correctness. For as long as tests are manually maintained I will continue keeping them as few as necessary while maintaining what I view as an acceptable amount of coverage.

by sandblast2

2 subcomments

The expertise in software engineering typical in these promptfondling companies shine through this blog post.
Surely they know 100% code coverage is not a magical bullet because the code flow and the behavior can differ depending on the input. Just because you found a few examples which happen to hit every line of code you didn't hit every possible combination. You are living in a fool's paradise which is not a surprise because only fools believe in LLMs. You are looking for a formal proof of the codebase which of course no one does because the costs would be astronomical (and LLMs are useless for it which is not at all unique because they are useless for everything software related but they are particularly unusable for this).

by krupan

1 subcomments

So many of us see an LLM spit out a bunch of code in a at a very high rate and we're amazed. It is really impressive, but what we're forgetting is that the amount of code and the speed at which code is written has never been the bottleneck in developing good quality software.
AI will revolutionize software development if and when it does a far better job of producing correct code than humans.

by lmeyerov

0 subcomment

Most of this rings true for us for the same reasons. We have been moving large old projects in this direction, and new ones start there. It's easier to do these via tool checks than trust skills files. I wouldn't say the resulting code is good, which folks are stumbling on, but it is rewarding better code - predictable, boring, tested, pure, and fast to iterate on, which are all indeed part of our SDLC principles.
Some of the advice is a bit more extreme, like I haven't found value in 100% code coverage, but 90% is fine. Others miss nuance like we have to work hard to prevent the AI from subverting the type checks, like by default it works around type errors by using getattr/cast/typeignore/Any everywhere.
One item I'm hoping is AI coders get better at is using static analysis tools and verification tools. My experiments here have been lukewarm/bad, like adding an Alloy model checker for some parts of GFQL (GPU graph query language) took a lot of prodding and found no bugs, but straight up asking codex to do test amplification on our unit test suite based on our code and past bugs works great. Likewise, it's easy to make it port conformance tests from standards and help with making our docs executable to help prevent drift.
A new area we are starting to look at is automatic bug patches based on production logs. This is practical for the areas we setup for vibe coding, which in turn are the areas we care about more and work most heavily on. We never trusted automated dependency update bots, but this kind of thing gets much more trustworthy & reviewable. Another thing we are eyeing is new 'teleport' modes so we can shift PRs to remote async development, which previously we didn't think worth supporting.

by adi_kurian

0 subcomment

What if 'good code' is just 'code optimized for humans who can't hold much in working memory'? The model doesn't need breadcrumbs if it can see everything at once. If context windows 100x, think some of this may be less relevant. Big IF, have no idea tbh, hard to predict.

by brynary

2 subcomments

Strong agreement with everything in this post.
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.

by nathan_f77

2 subcomments

This is exactly how I've been working with AI this year and I highly recommend it. This kind of workflow was not feasible when I was working alone and typing every line of code. Now it's suprisingly easy to achieve. In my latest project, I've enforced extremely strict linting rules and completely banned any ignore comments. No file over 500 lines, and I'm even using all the default settings to prevent complex functions (which I would have normally turned off a long time ago.)
Now I can leave an agent running, come back an hour or two later, and it's written almost perfect, typed, extremely well tested code.

by PunchyHamster

0 subcomment

I'm not covering every
```
   if err != nil {
      return fmt.Errorf(...)
   }
```
no matter what kind of glue vibe coders snorted that day

by altmanaltman

4 subcomments

Wouldn't a better title be "How we're forcing AI to write good code (because it's normally not that good in general, which is crazy, given how many resources it's sucking, that we need to add an extra layer on top of it and use it to get anything decent)"

by sebastianconcpt

0 subcomment

I feel more like is forcing to write better engineering not just the code.
The disruption comes from the economics of cognitive labor, the synthetic assistants are making feasible things that before were unbearably cognitively costly so manually we invested all that energy into the code parts.
I've made this to leverage that:
https://github.com/sebastianconcept/ai-squads

by cube00

2 subcomments

I can't reconcile how the CEO of an AI startup is; on one hand pushing "100% Percent [sic] Code Coverage" while also selling the idea of "Less than 60 seconds to production" on their product (which is linked in the first screen-full of the blog post so it's not like these are personal thoughts).
If 100% code coverage is a good thing, you can't tell me anyone (including parallel AI bots) is going to do this correctly and completely for a given use case in 60 seconds.
I don't mind it mind it being fast, but to sell it as 60 second fast while trying to give the appearance you support high quality and correct code isn't possible.

by chrsw

0 subcomment

I think organization technical leadership wants to deploy AI to ship faster, not so engineering teams can do what they should have been doing all along. "Sure, we can deploy these tools, but first we need to properly document our design" isn't going to fly. The point of buying these tools is so teams ship without really understanding what they're doing, because there's a time cost to that.

by Twey

0 subcomment

I'd be interested to hear how they reconcile ‘100% code coverage’ with ‘QA needs to run fast’ on a large codebase.
I'd also really love to see a study around how much of the effort it takes, on average, to write (by carefully shepherding an agent or otherwise) bullet-proof tests and other guardrails for the LLM-generated code divided by the effort of writing the code by hand.

by jennyholzer3

4 subcomments

I don't know about all this AI stuff.
How are LLMs going to stay on top of new design concepts, new languages, really anything new?
Can LLMs be trained to operate "fluently" with regards to a genuinely new concept?
I think LLMs are good for writing certain types of "bad code", i.e. if you're learning a new language or trying to quickly create a prototype.
However to me it seems like a security risk to try to write "good code" with an LLM.

by ojr

0 subcomment

I spent so much time getting the mocks right with AI tests and the tests could not be one shotted or done by an inexperienced intern. Certainly don't have the budget to through Claude Opus on it, I'll give it some time though maybe things change.

by nxobject

0 subcomment

Don't forget logging, logging, and lots of logging - whether printf or structured.

by zmmmmm

0 subcomment

Very little there about the code itself being good. A lot about putting good guardrails around it and making it fast and safe to develop. Which is good for sure. But I feel it's misconstruing it to say the actual code is "good". The whole reason the guard rails provide value is the code is, by default, "not good" and how good the result is presumably sitting in a spectrum between "the worst possible that satisfies the guardrails" and "actually good".

by jillesvangurp

0 subcomment

This goes in the right direction. It could go further though. Types are indeed nice. So, why use a language why using those is optional? There are many reasons but many of those have to do with people and their needs/wants rather than tool requirements. AI agents benefit from good tool feedback, so maybe switch to languages and frameworks that provide plenty of that and quickly. Switching used to be expensive. Because you had to do a lot of the work manually. That's no longer true. We can make LLMs do all of the tedious stuff.
Including using more rigidly typed languages, making sure things are covered with tests, using code analysis tools to spot anti patterns and addressing all the warnings, etc. That was always a good idea but we now have even less excuses to skip all that.

0 subcomment

by bwhiting2356

0 subcomment

I agree with this. 100% test coverage for front end is harder, I don't know if I'm going to reach for that yet. So far I've been making my linting rules stricter.

by victorbjorklund

0 subcomment

This is something I have seen. The code I write on projects I work on alone is a lot better today vs in the past because AI works better on a repo with good code quality. This can be writing smaller modules or even breaking out an API integration into its own library (something I seldom would do in the past).

by AuthAuth

1 subcomments

>Statement about how AI is actually really good and we should rely on it more. Doesnt cover any downsides.
>CEO of an AI company
Many such cases

by deaux

0 subcomment

Your footnotes seem to be in the wrong order - maybe you switched paragraphs around and they got out of sync?

by the_king

0 subcomment

I think good names and a good file structure are the most important thing to get right here.

by CraigJPerry

1 subcomments

> Entire categories of illegal states and transitions can be eliminated.
I have an over-developed, unhealthy interest in the utility of types for LLM generated code.
When an llm is predicting the next token to generate, my current level of understanding tells me that it makes sense that the llm's attention mechanism will be using the surrounding type signatures (in the case of an explicitly typed language) or the compiler error messages (in the cases where a language leans on implicit typing) to better predict that next token.
However, that does not seem to be the behaviour i observe. What i see is more akin to tokens in the type signature position in a piece of code often being generated without any seeming relationship to the instructions being written. It's common to generate code that the compiler rejects.
That problem is easily hidden and worked around - just wrap your llm invocation in a loop, feed in the compiler errors each time and you now have an "agent" that can stochastic gradient descent its way to a solution.
Given this, you could say well what does it matter, even if an LLM doesn't meaningfully "understand" the relationship between types and instructions, there's already a feedback loop and therefore a solution available - so why do we even need to care about the fact an llm may or may not treat types as a tool to accurately model the valid solution space.
Well i can't help think this is really the crux of software development. Either you're writing code to solve a defined problem (valuable) or you're doing something else that may mimic that to some degree but is not accurate (bugs).
All that said, pragmatically speaking, software with bugs is often still valuable.
TL;DR i'm currently thinking humans should always define the type signatures and test cases, these are too important to let an LLM "mid" its way through.

by jaredcwhite

5 subcomments

I'm sad programmers lacking a lot of experience will read this and think it's a solid run-down of good ideas.

by badgersnake

6 subcomments

I’m increasingly finding that the type of engineer that blogs is not they type of engineer anyone should listen to.

0 subcomment

by firemelt

1 subcomments

yeah beekeeping, I think about it alot, I mean the agentic should be isolated on their own environment, its dangerous to give then ur whole pc who nows they silently putting some rootkit or backdoor to ur pc, like appending allowed ssh keys

by mrits

3 subcomments

Author should ask AI to write a small app with 100% code coverage that breaks in every path except what is covered in the tests.

by bgwalter

2 subcomments

https://logic.inc/
"Ship AI features and tools in minutes, not weeks. Give Logic a spec, get a production API—typed, tested, versioned, and ready to deploy."

by cryptica

0 subcomment

I agree with the sentiment but I find this definition of 'good code' is a bit superficial for my liking.
Especially the part about TypeScript. My experience is that LLMs such as Claude Code work really well with vanilla JavaScript. Once you switch to TypeScript, you're tapping into a different language training set which is much smaller than the JS training set and which adheres to different conventions and principles.
The part about good test coverage makes sense though I don't know if 100% coverage is the specific goal to aim for. You can have 100% coverage in terms of lines of code but don't test the relevant parameters which cause issues.
My definition of good code is more about architecture; modularity, separation of concerns, minimal interfaces, choosing good abstractions and layering them appropriately, clearly separating trust boundaries with appropriate validation... Once the LLM sees certain things, it lets you tap into a "world class software engineer" training set.
A lot of the points mentioned in the article differentiate junior developer from mid-level developer... If you want the LLM to output 10x software engineer quality, the patterns are different and more nuanced... Goes beyond just having good test coverage.

by user____name

0 subcomment

I've been wondering if AI startups are running bots to downvote negative AI sentiment on HN. The hype is sort of ridiculous at times.

by pizlonator

0 subcomment

Why would I write code that makes it easier for a clanker to compete with me

by cess11

0 subcomment

I kind of feel that if you weren't doing this and start doing it to please a bunch of chatbots, then you're sending a pretty weird signal to your coworkers or employees. Like you care more about the bots than the people you work with.
Other than that, sure, good advice. If at all possible you should have watch -n 2 run_tests or test run on a file watcher on a screen while coding.
In my experience LLM:s like to add assertions and tests for impossible states, which is quite irritating, so I'd rather not do the agentic vibe thing anyway.

by throw-12-16

0 subcomment

Just wait until the LLM Agent starts rewriting tests to adhere to your 100% code coverage mandate.

by sublinear

1 subcomments

What? We're already so far down the list of things to try with AI that we're saying hallucinated tests are better than no tests at all?
Seems actively harmful, and the AI hype died out faster than I thought it would.
> Agents will happily be the Roomba that rolls over dog poop and drags it all over your house
There it is, folks!

by glemmaPaul

0 subcomment

Kool-aid salesmen is selling Kool-aid again

by block_dagger

1 subcomments

I stopped reading at “static typing.” That is not what “good code” always looks like.

by devhouse

0 subcomment

[dead]

by phplovesong

1 subcomments

LOL No. AI code i see is 90% really bad. The poster then snakes around the first commenter that asks "how much of the code was generated by AI?"
Replies vary from silence to "ill checked all the code" or "ai code is better than human code" or even "ai was not used at all", even it is obvious it was 100% AI.