> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.
The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).
Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.
If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.
By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.
Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.
Related, lately I've been getting tons of Anthropic Instagram ads; they must be near a quarter of all the sponsored content I see for the last month or so. Various people vibe coding random apps and whatnot using different incarnations of Claude. Or just direct adverts to "Install Claude Code." I really have no idea why I've been targeted so hard, on Instagram of all places. Their marketing team must be working overtime.
Except they regularly come up with "explanations" that are completely bogus and may actually waste an hour or two. Don't get me wrong, LLMs can be incredibly helpful for identifying bugs, but you still have to keep a critical mindset.
> As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete
I think part of the reason why I was initially more skeptical than I ought to have been is because chat is such a garbage modality. LLMs started to "click" for me with Claude Code/Codex.
A "continuously running" mode that would ping me would be interesting to try.
All the simple stuff (creating a repo, pushing, frontend edits, testing, Docker images, deployment, etc.) is automated. For the difficult parts, you can just use free Grok to one-shot small code files. It works great if you force yourself to keep the amount of code minimal and modular. Also, they are great UIs—you can create smart programs just with CLI + MCP servers + MD files. Truly amazing tech.
There's a risk there that the AI could find the solution by looking through your history to find it, instead of discovering it directly in the checked-out code. AI has done that in the past:
You can use Git hooks to do that. If you have tests and one fails, spawn an instance of claude a prompt -p 'tests/test4.sh failed, look in src/ and try and work out why'
    $ claude -p 'hello, just tell me a joke about databases'
    A SQL query walks into a bar, walks up to two tables and asks, "Can I JOIN you?"
    $ 
Or if, you use Gogs locally, you can add a Gogs hook to do the same on pre-push> An example hook script to verify what is about to be pushed. Called by "git push" after it has checked the remote status, but before anything has been pushed. If this script exits with a non-zero status nothing will be pushed.
I like this idea. I think I shall get Claude to work out the mechanism itself :)
It is even a suggestion on this Claude cheet sheet
https://www.howtouselinux.com/post/the-complete-claude-code-...
You can build this pretty easily: https://github.com/jasonjmcghee/claude-debugs-for-you
I feel like the article is giving out very bad advice which is going to end up shooting someone in the foot.
As another example, I think things like "write unit tests for this code" are usually similar sort of style transfer as well, based on how it writes the tests. It definitely has a good idea as to how to sort of ensure that all the functionality gets tested, I find it is less likely to produce "creative" ways that bugs may come out, but hey, it's a good start.
This isn't a criticism, it's intended to be a further exploration and understanding of when these tools can be better than you might intuitively think.
Before I used Claude, I would be surprised.
I think it works because Claude takes some standard coding issues and systematizes them. The list is long, but Claude doesn't run out of patience like a human being does. Or at least it has some credulity left after trying a few initial failed hypotheses. This being a cryptography problem helps a little bit, in that there are very specific keywords that might hint at a solution, but from my skim of the article, it seems like it was mostly a good old coding error, taking the high bits twice.
The standard issues are just a vague laundry list:
- Are you using the data you think you're using? (Bingo for this one)
- Could it be an overflow?
- Are the types right?
- Are you calling the function you think you're calling? Check internal, then external dependencies
- Is there some parameter you didn't consider?
And a bunch of others. When I ask Claude for a debug, it's always something that makes sense as a checklist item, but I'm often impressed by how it diligently followed the path set by the results of the investigation. It's a great donkey, really takes the drudgery out of my work, even if it sometimes takes just as long.
Quite different if you are not a cryptographer or a domain expert.
Last week I asked it to look at why a certain device enumeration caused a sigsegv, and it quickly solved the issue by completely removing the enumeration. No functionality, no bugs!