Claude Code Review: AI checking AI

Anthropic just shipped one of its most interesting features yet: Code Review for Claude Code. Announced on March 9 by Boris Cherny, Claude Code’s product lead, the feature dispatches a team of AI agents to review every pull request for bugs, security issues, and logic errors. It is currently in research preview for Team and Enterprise customers.

On the surface this sounds like just another AI-assisted development tool. But when you dig into the architecture and the reasoning behind it, something more fundamental emerges about where software engineering is heading.

The bottleneck has shifted

Cherny’s framing is revealing. He says Anthropic built it for themselves first because code output per engineer is up 200% this year, and reviews had become the bottleneck. This is the quiet consequence of AI-assisted coding that nobody talks about enough. Tools like Claude Code, Copilot, and Cursor have made writing code dramatically faster. But the review process hasn’t scaled to match.

Every team that has adopted AI coding tools aggressively has felt this. Pull requests pile up. Reviewers skim instead of reading deeply. Subtle bugs slip through because a human who has already reviewed five large PRs that morning just doesn’t have the cognitive bandwidth for the sixth. The code is being written faster than it can be responsibly validated.

Anthropic’s answer is to throw more AI at the problem, and the way they’ve done it is genuinely clever.

Multi-agent architecture: depth over speed

Code Review doesn’t work like a simple linter or a single-pass AI check. When a PR is opened, the system dispatches multiple agents that work in parallel. Each agent examines the changes in the context of the full codebase, looking for different categories of issues: logic errors, security vulnerabilities, broken edge cases, and regressions. The agents then verify each other’s findings to filter out false positives, and rank the remaining bugs by severity.

The result lands on the PR as a single overview comment plus inline annotations for specific issues. Average review time is about 20 minutes.

The numbers from internal use at Anthropic are striking. Before Code Review, only 16% of PRs received substantive review comments. After, that jumped to 54%. On large PRs over 1,000 lines, 84% receive findings, averaging 7.5 issues. On small PRs under 50 lines, 31% get findings. And less than 1% of findings are marked as incorrect by engineers.

That last number is the one that matters most. A code review tool that cries wolf is worse than no tool at all.

The philosophical objection

The obvious criticism is the one that’s been circulating since the announcement: if AI is writing the code, why does AI need to review it? Why not just generate correct code in the first place?

Cherny has a compelling response. He frames it through the concept of test-time compute : the more tokens you throw at a problem, the better the result. But crucially, using separate context windows creates different blind spots. The agent that wrote the code has one perspective. A fresh agent reviewing that same code, with a different context window, sees things the first one missed.

This mirrors how human code review actually works. The author of a piece of code is the worst person to review it, not because they’re incompetent, but because they’ve been staring at the same assumptions for hours. A fresh pair of eyes catches what familiarity hides. Separate AI contexts achieve the same effect.

Real bugs, not style nits

What stands out in the early reports is the type of issues being caught. This isn’t about formatting or naming conventions. One example from Anthropic’s own usage: a single-line production change that appeared routine was flagged as critical because it would have broken authentication for an entire service. TrueNAS , an early access customer, reported that Code Review surfaced a pre-existing type mismatch that was silently corrupting an encryption key cache during synchronization, a bug that human reviewers had missed entirely.

These are the kinds of bugs that cause production incidents at 2 AM on a Saturday. The ones that look perfectly fine in a diff view but have devastating downstream effects.

The cost question

At $15 to $25 per review , this is not cheap. For a team pushing 20 PRs a day, you’re looking at $300 to $500 daily, or roughly $6,000 to $10,000 per month. That’s a meaningful line item.

The intuitive ROI argument is that if one caught bug per month saves a day of incident response, the tool pays for itself. That’s plausible, but it remains speculative. All the performance stats — the jump from 16% to 54% substantive reviews, the sub-1% false positive rate — are self-reported by Anthropic based on their own internal usage. There is no independent verification yet, and Anthropic’s codebase and review culture may not be representative of yours.

For high-volume teams already drowning in PR backlogs, the value proposition could be strong. But for smaller teams with lower PR volumes, it’s worth asking whether better human review practices or mature static analysis tools would deliver more per dollar spent.

It’s also worth remembering that this is still a research preview. Pricing, review quality, and capabilities are all subject to change. Committing serious budget to a tool whose long-term trajectory is still unclear warrants caution. The promise is real, but so is the uncertainty.

What it can’t do

It’s worth being honest about the limitations. Automated review, even multi-agent review, remains weak at detecting architectural flaws. It won’t tell you that your microservice boundaries are wrong, or that you’re building the wrong feature entirely. It doesn’t understand your business domain the way a senior engineer who’s been on the team for three years does.

And there’s a deeper concern that deserves more attention: the mentorship angle. Code review has always been one of the primary ways senior engineers teach junior engineers. It’s where you learn not just what works, but why certain approaches are preferred. If review becomes automated, that learning channel narrows. Teams need to think deliberately about how to preserve knowledge transfer as these tools become standard.

Where this is heading

Claude Code Review feels like an early signal of a broader architectural pattern: multi-agent systems where specialized agents collaborate on different aspects of a task. We’re moving past the era of “one AI does one thing” toward coordinated agent teams.

For engineering teams, the practical takeaway is this: if you’re using AI to write code but still reviewing it the old way, you have a growing mismatch between your production speed and your quality assurance speed. That gap is where bugs live.

Anthropic built this tool because they felt that pain themselves. The 200% increase in code output made it impossible to maintain review quality with human reviewers alone. Most teams adopting AI coding tools aggressively will reach that same inflection point.

Code Review won’t replace human reviewers. But it might let them focus on the things they’re actually best at: architectural judgment, domain reasoning, and mentoring the next generation of engineers, while the AI handles the painstaking line-by-line analysis that humans were never great at anyway.

← Meer artikelen