by elzbardico
7 subcomments
- Funny thing is the structured output in the last example.
```
{
"reasoning": "`cfg` can be nil on line 42; dereferenced without check on line 47",
"finding": "Possible nil‑pointer dereference",
"confidence": 0.81
}
```
You know the confidence value is completely bogus, don't you?
- The problem is that, regardless of how you try to use "micro-agents " as a marketing term, LLMs are instructed to return a result.
They will always try to come up with something.
The example provided was a poor one. The comment from LLM was solid. Why would you comment out a step in the pipeline instead of just deleting it? I would comment the same in a PR.
- I think they skipped over a non-obvious motivating example too fast. On first glance, commenting out your CI test suite would be very bad to sneak into a random PR, and that review note might be justified.
I could imagine the situation might actually be more nuanced (e.g. adding new tests and some of them are commented out), but there isn't enough context to really determine that, and even in that case, it can be worth asking about commented out code in case the author left it that way by accident.
Aren't there plenty of more obvious nitpicks to highlight? A great nitpick example would be one where the model will also ask to reverse the resolution. E.g.
final var items = List.copyOf(...);
<-- Consider using an explicit type for the variable.
final List items = List.copyOf(...);
<-- Consider using var to avoid redundant type name.
This is clearly aggravating since it will always make review comments.
- I agree with the sentiment of this post. I my personal experience the usefulness of a LLM positively correlated with your ability to constrain the problem it should solve.
Prompts like 'Update this regex to match this new pattern' generally give better results than 'Fix this routing error in my server'.
Although this pattern seems true empirically, I've never seen any hard data to confirm this property(?). And this post is interesting but seems like a missed opportunity to back this idea with some numbers.
- what I saw using 5-6 tools like this:
- PR description is never useful they barely summarize the file changes
- 90% of comments are wrong or irrelevant wether it's because it's missing context, missing tribal knowledge, missing code quality rules or wrongly interpret the code change
- 5-10% of the time it actually spots something
Not entirely sure it's worth the noise
- > 2.3 Specialized Micro-Agents Over Generalized Rules
Initially, our instinct was to continuously add more rules into a single large prompt to handle edge cases
This has been my experience as well. However, it seems like the platforms like Cursor/Lovable/v0/et al are doing things differently
For example, this is Lovable’s leaked system prompt, 1550 lines: https://github.com/x1xhlol/system-prompts-and-models-of-ai-t...
Is there a trick to making gigantic system prompts work well?
by jangletown
0 subcomment
- "51% fewer false positives", how were you measuring? is this an internal or benchmarking dataset?
- "After extensive trial-and-error..."
IMO, this is the difference between building deterministic software and non-deterministic software (like an AI agent). It often boils down to randomly making tweaks and evaluating the outcome of those tweaks.
by kurtis_reed
0 subcomment
- There was a blog post from another AI code review tool: "How to Make LLMs Shut Up"
https://news.ycombinator.com/item?id=42451968
by jstummbillig
2 subcomments
- The multi agent thing with different roles is so obviously not a great concept, that I am very hesitant to build towards it, even thought it seems to win out right now. We want a AI that internally does what it needs to do to solve a problem, given a good enough problem description, tools and context. I really do not want to have to worry about breaking up tasks into chunks that are smaller than what I could handle myself, and I really hope that that in the near future this will go away.
- I’ve been testing this for the last few months, and it is now much quieter than before, and even more useful.
- When I read "51% fewer false positives" followed immediately by "Median comments per pull request cut by half" it makes me wonder how many true positives they find. That's maybe unfair as my reference is automated tooling in the security world, where the true-positive/false-positive ratio is so bad that a 50% reduction in false positives is a drop in the bucket
by iandanforth
1 subcomments
- I learned from a recent post (https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...) that finding security issues can take 100+ calls to an LLM to get good signal. So I wonder about agent implementers who are trying to get good signal out of single calls, even if they are specialized ones.
- we tried something simple. suprisingly exposed a lot; just ran same input twice through the agent, temp 0. diffed the reasoning trace token by token, didn't expect much honestly. but even small shifts showed up. one run said 'this may introduce risk'. other said 'this could cause issues'.. exact same code. made us realise prompt wasn't grounding the rationale path tight enough. wasn't hallucinating. just the why kept wobbling
- > Explicit reasoning improves clarity. Require your AI to clearly explain its rationale first—this boosts accuracy and simplifies debugging.
I wonder what models they are using because reasoning models do this by default, even if they don't give you that output.
This post reads more like a marketing blog post than any real world advice.
- Very vague post light on details, and as usual, feels more like a marketing pitch for the website.
by curiousgal
2 subcomments
- > Encouraged structured thinking by forcing the AI to justify its findings first, significantly reducing arbitrary conclusions.
Ah yes, because we know very well that the current generation of AI models reasons and draws conclusions based on logic and understanding... This is the true face palm.
by OnionBlender
0 subcomment
- What's funny about the bullet points in section 3 is that it only compares to the previous noisy agent, rather than having no agent. 51% fewer false positives, median comments per pull request cut by half, spending less time managing irrelevant comments? Turn it off and you could get a 100% reduction in false positives and spend zero time on irrevant AI generated comments.
- ah the joy of non-determinism. Have fun tweaking till you die. Also I wish youa lot of fun giving your customers buttons to disable/enable options.
by bumbledraven
0 subcomment
- What model were they using?
- Lessons.
- [flagged]