A security engineer's patience for AI-assisted code review is finite. In August 2025, we ran a two-week benchmark against five widely-advertised GenAI review tools, GitHub Copilot Code Review, Amazon CodeGuru with Q Developer, CodeRabbit, Qodo Merge (formerly PR-Agent), and a frontier-model baseline using Claude Sonnet 4.5, on a seeded corpus of 240 defects spanning CWE-89, CWE-79, CWE-502, CWE-611, CWE-918, and a long tail of business-logic flaws drawn from real post-mortems. The corpus used three languages: TypeScript, Python, and Go. We measured precision, recall, hallucination rate, and time-to-first-comment. The result: aggregate precision has improved meaningfully since our 2024 run, but hallucinations still average 18% across the field, and no tool cleared 70% recall on injection-class bugs. Here is what we found and how it should change how you buy.
Which tool had the highest recall on our seeded defects?
CodeRabbit led overall recall at 64%, followed by the Claude Sonnet 4.5 baseline at 61%, Copilot Code Review at 54%, Qodo Merge at 49%, and CodeGuru at 41%. Recall was uneven by category. Every tool did well on obvious SQL and command injection via string concatenation (above 80%) and poorly on authorization flaws, which require understanding request context (below 30%). Deserialization flaws (CWE-502) sat in the middle, with Copilot surprisingly strong on Python pickle patterns at 72% recall after its July 2025 release.
Where are the tools still hallucinating findings?
Across 240 PRs, tools produced 1,470 total comments, of which 268 were false positives that appeared plausible but were wrong. The failure modes cluster: (1) asserting missing authentication on routes already guarded by middleware, (2) flagging hard-coded secrets that were actually test fixtures in a tests/ directory, and (3) inventing CVE numbers. CodeGuru produced the lowest hallucination rate at 11%; Qodo Merge was highest at 27%. The hallucinated CVE pattern is especially dangerous because it can slip into release notes if reviewers do not verify.
How do these tools perform on business-logic vulnerabilities?
Poorly, with one interesting exception. Business-logic defects, such as a missing tenant check in a list endpoint, averaged 21% recall across the field. The exception was Claude Sonnet 4.5 when we prepended the PR with a CLAUDE.md-style project brief describing the authorization model. Recall on business-logic issues rose from 24% to 51% with that context. This replicates a finding from Google's 2025 SECL benchmark: context of roughly 2,000 tokens of project intent improves recall on logic bugs by roughly 2x, while adding little to injection-class detection.
# Example missed business-logic bug (every tool but one missed)
- items = db.list_items(owner_id=user.id)
+ items = db.list_items(owner_id=request.query.owner_id)
return jsonify(items)
How do they compare on time-to-first-comment and cost?
Copilot Code Review was fastest at a median 14 seconds on 500-line PRs and cheapest per comment at roughly $0.03. CodeRabbit was slowest at 71 seconds median but produced the most comments per PR (average 8.4). Qodo Merge offered the most configurable prompt templates. The Claude baseline was mid-pack on speed at 28 seconds and highest cost at $0.19 per PR, though its output benefited most from custom instructions. For high-velocity teams, Copilot and CodeGuru hit the sweet spot; for security-critical repos, CodeRabbit's higher recall offsets its latency.
Should GenAI review replace traditional SAST?
No. SAST and GenAI review are complementary. Semgrep, CodeQL, and commercial SAST still dominate on compliance-mandated checks, provenance, and repeatability. GenAI tools excel at reading diff context, proposing concrete remediations, and flagging patterns SAST rules do not cover. The right 2025 pattern is: SAST as a blocking gate, GenAI as an advisory reviewer with a human signoff, and no tool allowed to write code to main without explicit approval. Teams that replaced SAST with GenAI in 2024 and 2025 uniformly reintroduced it within months.
How Safeguard Helps
Safeguard embeds AI-generated remediation inside the workflow reviewers already use, without replacing deterministic scanning. Every finding, whether surfaced by a static analyzer, an SBOM scanner, or a GenAI reviewer, lands in the same queue with provenance, severity, and a proposed fix. Guardrails detect and suppress hallucinated CVE references before they reach developers, and policy gates require a human approval on any AI-suggested change that touches authentication, authorization, or crypto-handling code. The net effect is the speed of a GenAI reviewer with the audit trail of a traditional SAST pipeline.