Since it sticks pretty close to the spec and since TLA+ is about modifying state, the code it generates is pretty ugly, but ugly-and-correct code beats beautiful code that's not verified.
It's not perfect; something that naively adheres to a spec is rarely optimized, and I've had to go in and replace stuff with Tokio or Mio or optimize a loop because the resulting code is too slow to be useful, and sometimes the code is just too ugly for me to put up with so I need to rewrite it, but the amount of time to do that is generally considerably lower than if I were doing the translation myself entirely.
The reason I started doing this: the stuff I've been experimenting with lately has been lock-free data structures, and I guess what I am doing is novel enough that Codex does not really appear to generate what I want; it will still use locks and lock files and when I complain it will do the traditional "You're absolutely right", and then proceed to do everything with locks anyway.
In a sense, this is close to the ideal case that I actually wanted: I can focus on the high-level mathey logic while I let my metaphorical AI intern deal with the minutia of actually writing the code. Not that I don't derive any enjoyment out of writing Rust or something, but the code is mostly an implementation detail to me. This way, I'm kind of doing what I'm supposed to be doing, which is "formally specify first, write code second".
That said, we're not talking about vibe coding here, but properly reviewed code, right? So the human still goes "no, this is wrong, delete these tests and implement for these criteria"?
(A thing I think is under-explored is how much LLMs change where the value of tests are. Back in the artisan hand-crafted code days, unit tests were mostly useful as scaffolding: Almost all the value I got from them was during the writing of the code. If I'd deleted the unit tests before merging, I'd've gotten 90% of the value out of them. Whereas now, the AI doesn't necessarily need unit tests as scaffolding as much as I do, _but_ having them put in there makes future agentic interactions safer, because they act as reified context.)
What I notice is that Claude stumbles more on code that is illogical, unclear or has bad variable names. For example if a variable is name "iteration_count" but actually contains a sum that will "fool" AI.
So keeping the code tidy gives the AI clearer hints on what's going on which gives better results. But I guess that's equally true for humans.
Obviously with AI maybe those issues I have go away. But I really don’t like letting the AI modify tests without meticulously manually reviewing those changes, because in my experience the AI cares more about getting the tests passing than it does about ensuring semantic correctness. For as long as tests are manually maintained I will continue keeping them as few as necessary while maintaining what I view as an acceptable amount of coverage.
Surely they know 100% code coverage is not a magical bullet because the code flow and the behavior can differ depending on the input. Just because you found a few examples which happen to hit every line of code you didn't hit every possible combination. You are living in a fool's paradise which is not a surprise because only fools believe in LLMs. You are looking for a formal proof of the codebase which of course no one does because the costs would be astronomical (and LLMs are useless for it which is not at all unique because they are useless for everything software related but they are particularly unusable for this).
AI will revolutionize software development if and when it does a far better job of producing correct code than humans.
Some of the advice is a bit more extreme, like I haven't found value in 100% code coverage, but 90% is fine. Others miss nuance like we have to work hard to prevent the AI from subverting the type checks, like by default it works around type errors by using getattr/cast/typeignore/Any everywhere.
One item I'm hoping is AI coders get better at is using static analysis tools and verification tools. My experiments here have been lukewarm/bad, like adding an Alloy model checker for some parts of GFQL (GPU graph query language) took a lot of prodding and found no bugs, but straight up asking codex to do test amplification on our unit test suite based on our code and past bugs works great. Likewise, it's easy to make it port conformance tests from standards and help with making our docs executable to help prevent drift.
A new area we are starting to look at is automatic bug patches based on production logs. This is practical for the areas we setup for vibe coding, which in turn are the areas we care about more and work most heavily on. We never trusted automated dependency update bots, but this kind of thing gets much more trustworthy & reviewable. Another thing we are eyeing is new 'teleport' modes so we can shift PRs to remote async development, which previously we didn't think worth supporting.
At Qlty, we are going so far as to rewrite hundreds of thousands of lines of code to ensure full test coverage, end-to-end type checking (including database-generated types).
I’ll add a few more:
1. Zero thrown errors. These effectively disable the type checker and act as goto statements. We use neverthrow for Rust-like Result types in TypeScript.
2. Fast auto-formatting and linting. An AI code review is not a substitute for a deterministic result in sub-100ms to guarantee consistency. The auto-formatter is set up as a post-tool use Claude hook.
3. Side-effect free imports and construction. You should be able to load all the code files and construct an instance of every class in your app without a network connection spawning. This is harder than it sounds and without it you run into all sorts of trouble with the rest.
3. Zero mocks and shared global state. By mocks, I mean mocking frameworks which override functions on existing types or global. These effectively are injecting lies into the type checker.
Should put to tsgo which has dramatically lowered our type checking latency. As the tok/sec of models keeps going up, all the time is going to get bottlenecked on tool calls (read: type checking and tests).
With this approach we now have near 100% coverage with a test suite that runs in under 1,000ms.
Now I can leave an agent running, come back an hour or two later, and it's written almost perfect, typed, extremely well tested code.
if err != nil {
return fmt.Errorf(...)
}
no matter what kind of glue vibe coders snorted that dayThe disruption comes from the economics of cognitive labor, the synthetic assistants are making feasible things that before were unbearably cognitively costly so manually we invested all that energy into the code parts.
I've made this to leverage that:
If 100% code coverage is a good thing, you can't tell me anyone (including parallel AI bots) is going to do this correctly and completely for a given use case in 60 seconds.
I don't mind it mind it being fast, but to sell it as 60 second fast while trying to give the appearance you support high quality and correct code isn't possible.
I'd also really love to see a study around how much of the effort it takes, on average, to write (by carefully shepherding an agent or otherwise) bullet-proof tests and other guardrails for the LLM-generated code divided by the effort of writing the code by hand.
How are LLMs going to stay on top of new design concepts, new languages, really anything new?
Can LLMs be trained to operate "fluently" with regards to a genuinely new concept?
I think LLMs are good for writing certain types of "bad code", i.e. if you're learning a new language or trying to quickly create a prototype.
However to me it seems like a security risk to try to write "good code" with an LLM.
Including using more rigidly typed languages, making sure things are covered with tests, using code analysis tools to spot anti patterns and addressing all the warnings, etc. That was always a good idea but we now have even less excuses to skip all that.
>CEO of an AI company
Many such cases
I have an over-developed, unhealthy interest in the utility of types for LLM generated code.
When an llm is predicting the next token to generate, my current level of understanding tells me that it makes sense that the llm's attention mechanism will be using the surrounding type signatures (in the case of an explicitly typed language) or the compiler error messages (in the cases where a language leans on implicit typing) to better predict that next token.
However, that does not seem to be the behaviour i observe. What i see is more akin to tokens in the type signature position in a piece of code often being generated without any seeming relationship to the instructions being written. It's common to generate code that the compiler rejects.
That problem is easily hidden and worked around - just wrap your llm invocation in a loop, feed in the compiler errors each time and you now have an "agent" that can stochastic gradient descent its way to a solution.
Given this, you could say well what does it matter, even if an LLM doesn't meaningfully "understand" the relationship between types and instructions, there's already a feedback loop and therefore a solution available - so why do we even need to care about the fact an llm may or may not treat types as a tool to accurately model the valid solution space.
Well i can't help think this is really the crux of software development. Either you're writing code to solve a defined problem (valuable) or you're doing something else that may mimic that to some degree but is not accurate (bugs).
All that said, pragmatically speaking, software with bugs is often still valuable.
TL;DR i'm currently thinking humans should always define the type signatures and test cases, these are too important to let an LLM "mid" its way through.
"Ship AI features and tools in minutes, not weeks. Give Logic a spec, get a production API—typed, tested, versioned, and ready to deploy."
Especially the part about TypeScript. My experience is that LLMs such as Claude Code work really well with vanilla JavaScript. Once you switch to TypeScript, you're tapping into a different language training set which is much smaller than the JS training set and which adheres to different conventions and principles.
The part about good test coverage makes sense though I don't know if 100% coverage is the specific goal to aim for. You can have 100% coverage in terms of lines of code but don't test the relevant parameters which cause issues.
My definition of good code is more about architecture; modularity, separation of concerns, minimal interfaces, choosing good abstractions and layering them appropriately, clearly separating trust boundaries with appropriate validation... Once the LLM sees certain things, it lets you tap into a "world class software engineer" training set.
A lot of the points mentioned in the article differentiate junior developer from mid-level developer... If you want the LLM to output 10x software engineer quality, the patterns are different and more nuanced... Goes beyond just having good test coverage.
Other than that, sure, good advice. If at all possible you should have watch -n 2 run_tests or test run on a file watcher on a screen while coding.
In my experience LLM:s like to add assertions and tests for impossible states, which is quite irritating, so I'd rather not do the agentic vibe thing anyway.
Seems actively harmful, and the AI hype died out faster than I thought it would.
> Agents will happily be the Roomba that rolls over dog poop and drags it all over your house
There it is, folks!
Replies vary from silence to "ill checked all the code" or "ai code is better than human code" or even "ai was not used at all", even it is obvious it was 100% AI.