FRESH

Hacker News

Home

Claude Code can debug low-level cryptography

465 points by Bogdanp

by simonw

12 subcomments

Using coding agents to track down the root cause of bugs like this works really well:
> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.
The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).
Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

by XenophileJKO

2 subcomments

Personally my biggest piece of advice is: AI First.
If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.
By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.
Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

by pton_xd

2 subcomments

> Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers.
Related, lately I've been getting tons of Anthropic Instagram ads; they must be near a quarter of all the sponsored content I see for the last month or so. Various people vibe coding random apps and whatnot using different incarnations of Claude. Or just direct adverts to "Install Claude Code." I really have no idea why I've been targeted so hard, on Instagram of all places. Their marketing team must be working overtime.

by spacechild1

1 subcomments

> Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.
Except they regularly come up with "explanations" that are completely bogus and may actually waste an hour or two. Don't get me wrong, LLMs can be incredibly helpful for identifying bugs, but you still have to keep a critical mindset.

by qsort

2 subcomments

This resonates with me a lot:
> As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete
I think part of the reason why I was initially more skeptical than I ought to have been is because chat is such a garbage modality. LLMs started to "click" for me with Claude Code/Codex.
A "continuously running" mode that would ping me would be interesting to try.

by Frannky

2 subcomments

CLI terminals are incredibly powerful. They are also free if you use Gemini CLI or Qwen Code. Plus, you can access any OpenAI-compatible API(2k TPS via Cerebras at 2$/M or local models). And you can use them in IDEs like Zed with ACP mode.
All the simple stuff (creating a repo, pushing, frontend edits, testing, Docker images, deployment, etc.) is automated. For the difficult parts, you can just use free Grok to one-shot small code files. It works great if you force yourself to keep the amount of code minimal and modular. Also, they are great UIs—you can create smart programs just with CLI + MCP servers + MD files. Truly amazing tech.

by Thorrez

0 subcomment

>so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session
There's a risk there that the AI could find the solution by looking through your history to find it, instead of discovering it directly in the checked-out code. AI has done that in the past:
https://news.ycombinator.com/item?id=45214670

by delaminator

1 subcomments

> For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it?
You can use Git hooks to do that. If you have tests and one fails, spawn an instance of claude a prompt -p 'tests/test4.sh failed, look in src/ and try and work out why'
```
    $ claude -p 'hello, just tell me a joke about databases'

    A SQL query walks into a bar, walks up to two tables and asks, "Can I JOIN you?"

    $ 
```
Or if, you use Gogs locally, you can add a Gogs hook to do the same on pre-push
> An example hook script to verify what is about to be pushed. Called by "git push" after it has checked the remote status, but before anything has been pushed. If this script exits with a non-zero status nothing will be pushed.
I like this idea. I think I shall get Claude to work out the mechanism itself :)
It is even a suggestion on this Claude cheet sheet
https://www.howtouselinux.com/post/the-complete-claude-code-...

by wrs

0 subcomment

Yesterday I tried something similar with a website (Go backend) that was doing a complex query/filter and showing the wrong results. I just described the problem to Claude Code, told it how to run psql, and gave it an authenticated cookie to curl the backend with. In about three minutes of playing around, it fixed the problem. It only needed raw psql and curl access, no specialized tooling, to write a bunch of bash commands to poke around and compare the backend results with the test database.

by jasonjmcghee

0 subcomment

I found llm debugging to work better if you give the llm access to a debugger.
You can build this pretty easily: https://github.com/jasonjmcghee/claude-debugs-for-you

by gdevenyi

0 subcomment

Coming soon, adversarial attacks on LLM training to ensure cryptographic mistakes.

0 subcomment

by phendrenad2

0 subcomment

A whole class of tedious problems have been eliminated by LLMs because they are able to look at code in a "fuzzy" way. But this can be a liability, too. I have a codebase that "looks kinda" like a nodejs project, so AI agents usually assume it is one, even if I rename the package.json, it will inspect the contents and immediately clock it as "node-like".

by didibus

0 subcomment

This is basically the ideal scenario for coding agents. Easily verifiable through running tests, pure logic, algorithmic problem. It's the case that has worked the best for me with LLMs.

by cluckindan

2 subcomments

So the ”fix” includes a completely new function? In a cryptography implementation?
I feel like the article is giving out very bad advice which is going to end up shooting someone in the foot.

by jerf

0 subcomment

This is one of the things I've mentioned before, I think it's just hidden a bit and hard to see, but this is basically the LLM doing style transfer, which they're really good at. There was a specification for the code (which looks like it was already trained into the LLM since it didn't have to go fetch it but it also had intimate knowledge of), there was an implementation, and it's really good at extracting out the style difference between code and spec. Anything that looks like style transfer is a good use for LLMs.
As another example, I think things like "write unit tests for this code" are usually similar sort of style transfer as well, based on how it writes the tests. It definitely has a good idea as to how to sort of ensure that all the functionality gets tested, I find it is less likely to produce "creative" ways that bugs may come out, but hey, it's a good start.
This isn't a criticism, it's intended to be a further exploration and understanding of when these tools can be better than you might intuitively think.

by zcw100

0 subcomment

I just recently found a number of bugs in both the RELIC and MCL libraries. It took a while to track them down but it was remarkable that it was able to find them at all.

by lordnacho

2 subcomments

I'm not surprised it worked.
Before I used Claude, I would be surprised.
I think it works because Claude takes some standard coding issues and systematizes them. The list is long, but Claude doesn't run out of patience like a human being does. Or at least it has some credulity left after trying a few initial failed hypotheses. This being a cryptography problem helps a little bit, in that there are very specific keywords that might hint at a solution, but from my skim of the article, it seems like it was mostly a good old coding error, taking the high bits twice.
The standard issues are just a vague laundry list:
- Are you using the data you think you're using? (Bingo for this one)
- Could it be an overflow?
- Are the types right?
- Are you calling the function you think you're calling? Check internal, then external dependencies
- Is there some parameter you didn't consider?
And a bunch of others. When I ask Claude for a debug, it's always something that makes sense as a checklist item, but I'm often impressed by how it diligently followed the path set by the results of the investigation. It's a great donkey, really takes the drudgery out of my work, even if it sometimes takes just as long.

by deadbabe

3 subcomments

With AI, we will finally be able to do the impossible: roll our own crypto.

by rvz

1 subcomments

As declared by an expert in cryptography who knows how to guide the LLM into debugging low-level cryptography, which that's good.
Quite different if you are not a cryptographer or a domain expert.

by nikanj

1 subcomments

I'm surprised it didn't fix it by removing the code. In my experience, if you give Claude a failing test, it fixes it by hard-coding the code to return the value expected by the test or something similar.
Last week I asked it to look at why a certain device enumeration caused a sigsegv, and it quickly solved the issue by completely removing the enumeration. No functionality, no bugs!