As a result, I'm mostly using this selectively so far, and I wouldn't want it turned on by default for every PR.
Also the system prompts for some of them are kinda funny in a hopelessly naive aspirational way. We should all aspire to live and breathe the code review system prompt on a daily basis.
In our case, agentastic.dev, we just baked the code-review right into our IDE. It just packages the diff for the agent, with some prompt, and sends it out to different agent choice (whether claude, codex) in parallel. The reason our users like it so much is because they don't need to pay extra for code-review anymore. Hard to beat free add-on, and cherry on top is you don't need to read a freaking poems.
However, I disagree that independence is a competitive advantage. If it’s true that having a “firewall” between the coding agent and review agent leads to better code, I don’t see why a company like Cursor can’t create full independence between their coding and review products but still bundle them together for distribution.
Furthermore, there might well be benefits to not being fully independent. Imagine if an external auditor was brought in to review every decision made inside your company. There would likely be many things they simply don’t understand. Many decisions in code might seem irrational to an external standalone entity but make sense in the broader context of the organization’s goals. In this sense, I’m concerned that fully independent code review might miss the forest for the trees relative to a bundled product.
Again, I’m rooting for you guys. But I think this is food for thought.
* Suggested to silence exception instead of crash and burn for "style" (the potential exception was handled earlier in code but it did not manage to catch that context). When I commented that silencing the exception could lead to uncaught bugs it replies "You're absolutely right, remove the try-catch" which I of course never added * Us using python 3.14 is a logic error as "python 3.14 does not exist yet" * "Review the async/await patterns Heavy use of async in model validation might indicate these should be application services instead." whatever this vague sentence means. Not sure if it is suggesting us changing the design pattern used in our entire code base.
Also the "confidence" score added to each PR being 4/5 or something due to these irrelevant comments was a really annoying feature IMO. In general AI tools giving a rating when they're wrong feels like a big productivity loss as then the human reviewer will see that number and think something is wrong with the PR.
--
Before this we were running Coderabbit which worked really well and caught a lot of bugs / implementation gotchas. It also had "learnings" which it referenced frequently so it seems like it actually did not repeat commenting on intentional things in our code base. With Coderabbit I found myself wanting to read the low confidence comments as well since they were often useful (so too quiet instead of too noisy). Unfortunately our entire Coderabbit integration just stopped working one day and since then we've been in a long back and forth with their support.
--
I'm not sure what the secret sauce is but it feels like Greptile was GPT 3.5-tier and Coderabbit was Sonnet 4.5-tier.
Two that stood out to me are Sentry and Vercel. Both have released code review tools recently and both feel misplaced. I can definitely see why they thought they could expand with that type of product offering but I just don't see a benefit over their competition. We have GH copilot natively available on all our PRs, it does a great job, integrates very well with the PR comment system, and is cheap (free with our current usage patterns). GH and other source control services are well placed to have first-class code review functionality baked into their PR tooling.
It's not really clear to me what Sentry/Vercel are offering beyond what copilot does and in my brief testing of them didn't see noticeable difference in quality or DX. Feels like they're fighting an uphill battle from day one with the product choice and are ultimately limited on DX by how deeply GH and other source control service allow them to integrate.
What I would love to see from Vercel, which they feel very well placed to offer, is AI powered QA. They already control the preview environments being deployed to for each PR, they have a feedback system in place with their Vercel toolbar comments, so they "just" need to tie those together with an agentic QA system. A much loftier goal of course but a differentiator and something I'm sure a lot of teams would pay top dollar for if it works well.
Ok good, now I know not to bother reading through any of their marketing literature, because while the product at first interested me, now I know it's exactly not what I want for my team.
The actual "bubble" we have right now is a situation where people can produce and publish code they don't understand, and where engineers working on a system no longer are forced to reckon with and learn the intricacies of their system, and even senior engineers don't gain literacy into the very thing they're working on, and so are somewhat powerless to assess quality and deal with crisis when it hits.
The agentic coding tools and review tools I want my team (and myself) to have access to are ones that ones that force an explicit knowledge interview & acquisition process during authoring and involve the engineer more intricately in the whole flow.
What we got instead with claude code & friends is a thing way too eager to take over the whole thing. And while it can produce some good results it doesn't produce understandable systems.
To be clear, it's been a long time since writing code has been the hard part of the job? in many many domains. The hard part is systems & architecture and while these tools can help with that, there's nothing more potentially terrifying pthan a team full of people who have agentically produced a codebase that they cannot holistically understand the nuances of.
So, yeah, I want review tools for that scenario. Since these people have marketed themselves off the table... what is out there?
> Independence
Any "agent" running against code review instead of code generation is "independent"?
> Autonomy
Most other code review tools can also be automated and integrated.
> Loops
You can also ping other code review tools for more reviews...
I feel like this article actually works against you by presenting the problem and inadequately solving them.
So, absolutely necessary and essential?
In order to get the machine out of trouble when the unavoidable strange situation happens that didn't appear during training, and requires some judgement based on ethics or logical reasoning. For that case, you need a human in charge.
> Today's agents are better than the median human code reviewer
Which is it? You cannot have it both ways.
- it will have higher chance at convincing the author that the issue was important by throwing more darts - something that a human wouldn't do because it takes real mental effort to go through an authentic review,
- it will sometimes find real big issue which reinforces the bias that it's useful
- there will always be tendency towards more feedback (not higher quality) because if it's too silent, is it even doing anything?
So I believe it will just add more round of back and forth of prompting between more people, but not sure if net positive
Plus PRs are a good reality check if your code makes sense, when another person reviews it. A final safeguard before maintainability miss, or a disaster waiting to be deployed.
Here's a summary of the top-level ideas behind it. Hope it's helpful!
Core Philosophy
- "Advisor, not gatekeeper" - Every issue includes a "Could be wrong if..." caveat because context matters and AI can't see everything. Developers make the final call.
(Just this idea makes it less annoying and stops devs going down rabbit holes because it it pretty good at thinking why it might be wrong)
- Prompt it to be critical but not pedantic - Focus on REAL problems that matter (bugs, security, performance), not style nitpicks that linters handle.
- Get the team to run it on the command line just before each commit. Small, focused reviews not after batching 10 commits. Small diffs get better feedback.
Smart Context Gathering
- Full file contents, not just diffs - The tool reads complete changed files plus 1-level-deep imports to understand how changed code interacts with the codebase.
Prompt Engineering
- Diff-first, context-second - The diff is marked as "REVIEW THIS" while context files are explicitly marked "DO NOT REVIEW - FOR UNDERSTANDING ONLY" to prevent false positives on unchanged code. BUT that extra context makes a huge difference in correctness.
- Structured output format - Emoji-prefixed bullets ( Critical, Major, Minor), max 3 issues per section, no fluff or praise.
- Explicit "Do NOT" list - Prevents common AI review mistakes: don't flag formatting (Prettier handles it), don't flag TypeScript errors (IDE shows them), don't repeat issues across files, don't guess line numbers.
Final note
- Also plugged it in to a github action for last pass, but again non blocking.
So in the end I think there will still be some disappointment, as one would expect it should be fully automated and only about reading the code, like this article suggests. In reality, I think it is harder than writing code.
A code review requires reasoning and understanding, things that to my knowledge a generative model cannot do.
Surely the most an AI code review ever could be is something that looks like a code review.
When developers create a PR, they already feel they are "done", and they have likely already shifted their focus on another task. False positive are horrible at this point, especially when they keep changing with each push of commits.
2. What on earth is this defense of their product? I could see so many arguments for why their code reviewer is the best, and this contains none of them.
More broadly, though, if you've gotten to the point where you're relying on AI code review to catch bugs, you've lost the plot.
The point of a PR is to share knowledge and to catch structural gaps. Bug-finding is a bonus. Catching bugs, automated self-review, structuring your code to be sensible: that's _your_ job. Write the code to be as sensible as possible, either by yourself or with an AI. Get the review because you work on a team, not in a vacuum.
I feel like these are often not well defined? "Its not a bug it's a feature", "premature optimization is the root of all evil", etc
In different contexts, "performant enough" means different things. Similarly, many times I've seen different teams within a company have differing opinions on "correctness"
I would be interested to hear of some specific use-cases for LLMs in code review.
With static analysis, tests, and formatters I thought code review was mostly interpersonal at this point. Mentorship, ensuring a chain of liability in approvals, negotiating comfort levels among peers with the shared responsibility of maintaining the code, that kind of thing.
What do the vendors provide?
I looked at a couple which were pretty snazzy at first glance, but now that I know more about how copilot agents work and such, I'm pretty sure in a few hours, I could have the foundation for my team to build on that would take care of a lot of our PR review needs....
I get the idea. I'll still throw out that having a single X go through the full workflow could still be useful in that there's an audit log, undo features (reverting a PR), notifications what have you. It's not equivalent to "human writes ticket, code deployed live" for that reason
> Based on our benchmarks, we are uniquely good at catching bugs. However, if all company blogs are to be trusted, this is something we have in common with every other AI code review product. One just has to try a few, and pick the one that feels the best.
https://www.augmentcode.com/blog/we-benchmarked-7-ai-code-re...
has anyone tried it?
Or it won’t understand some invariant that you know but is not explicit anywhere
You do get a handful of false positives, especially if what it reports is technically correct, but we’re just handling the issue in a sort of weird/undocumented way. But it’s only one one comment that’s easy to dismiss, and it’s fairly rare. It’s not like huge amounts of AI vomit all over PRs. It’s a lot more focused.
I would think this idea of creating a third-party to verify things likely centers more around liability/safety cover for a steroidal increase in velocity (i.e. --dangerously-skip-permissions) rather than anything particularly pragmatic or technical (but still poised to capture a ton of value)
> As the proprietors of an, er, AI code review tool suddenly beset by an avalanche of competition, we're asking ourselves: what makes us different?
> Human engineers should be focused only on two things - coming up with brilliant ideas for what should exist, and expressing their vision and taste to agents that do the cruft of turning it all into clean, performant code.
> If there is ambiguity at any point, the agents Slack the human to clarify.
Was this LLM advertisement generated by an LLM? Feels so at least.
still need HITL, but the human is shifted right and can do other things rather than grinding through fiddly details.
We spend a ton of time looking at the code and blocking merges, and the end result is still full of bugs. AI code review only provides a minor improvement. The only reason we do code review at all is humans don't trust that the code works. Know another way to tell if code works? Running it. If our code is so utterly inconceivable that we can't make tests that can accurately assess if the code works, then either our code design is too complicated, or our tests suck.
OTOH, if the reason you're doing code review is to ensure the code "is beautiful" or "is maintainable", again, this is a human concern; the AI doesn't care. In fact, it's becoming apparent that it's easier to replace entire sections of code with new AI generated code than to edit it.
I have to be fair and say that yes, occasionally, some bug slips past the humans and is caught by the robot. But these bugs are usually also caught by automated unit/integration tests or by linters. All in all, you have to balance the occasional bug with all the time lost "reviewing the code review" to make sure the robot didn't just hallucinate something.
Code review pressupose a different perspective, which no platform can offer at the moment because they are just as sophisticated as the model they wrap. Claude generated the code, and Claude was asked if the code was good enough, and now you want to be in the middle to ask Claude again but with more emphasis, I guess? If I want more emphasis I can ask Claude myself. Or Qwen. I can't even begin to understand this rationale.
Not my experience
> A human rubber-stamping code being validated by a super intelligent machine
What? I dunno how they define intelligence, but LLMS are absolutely not super intelligent.
> If agents are approving code, it would be quite absurd and perhaps non-compliant to have the agent that wrote the code also approve the code.
It's all the same frontier models under the hood. Who are you kidding.
since they're likely telling you things you know if you test and write your own code.
oh - writing your own code is a thing of the past - a.i writes, a.i then finds bugs
Can drop the extra words