Skip to content

Layered PR review — because AI PRs ship 1.7× more issues

Scorecard question: How do you automate review on your PRs? Max‑score answer (3 pts): Layered — multiple AI reviewers on every PR (e.g. CodeRabbit / Greptile + Claude Code Action / Codex / Sentry Seer) + custom /review + multi‑agent review (e.g. /ultrareview).

The numbers are no longer hand‑wavy: CodeRabbit’s State of AI vs Human Code Generation report (December 2025) analyzed 470 open‑source GitHub pull requests and found that AI‑co‑authored PRs ship 1.7× more issues than human‑only PRs — 10.83 issues per PR vs 6.45. Break it down by severity and the gap is worse: 1.4× more critical issues and 1.7× more major issues. Drill into categories and you see why a single‑tool review misses so much — AI‑generated code shows 75% more logic/correctness issues, 3× more readability issues, up to 2.74× more security vulnerabilities, 2.66× more formatting problems, and ~2× more error‑handling gaps. This is not a “models will catch up next quarter” problem. It is the structural cost of generating code faster than any one reviewer — human or AI — can audit. The Register summarized the same data in December 2025 as “AI‑authored code needs more attention, contains worse bugs,” and the industry numbers backed it: incidents per pull request jumped 23.5% in 2026 and change failure rates climbed 30%, even though 75% of developers said they were manually reviewing AI code. Manual review alone is not enough. Single‑tool AI review is not enough either — every reviewer has blind spots, and AI bug categories are wide enough that one tool’s strengths (CodeRabbit on style and breadth) leaves the other tool’s strengths (Greptile on cross‑file context, Seer on production‑awareness) on the table. The max‑score Q17 answer — layered review — is the only structural fix that matches the structural problem. You stack three or four AI reviewers with different specialties on every PR, add a custom /review slash command for your repo’s specific anti‑patterns, and run a multi‑agent /ultrareview for high‑risk diffs. Cost? An hour of setup and a few dollars of API spend per month. Upside? You catch the critical bug that ships at 11pm on Friday before it becomes a Saturday‑morning rollback.

A max‑score Q17 setup is boringly mechanical day to day. You open a PR. Within 90 seconds, three to five reviewers post in parallel: CodeRabbit (style, refactoring, breadth), Greptile (cross‑file context, repo‑aware logic), Claude Code GitHub Action triggered by @claude or label, Codex Cloud review (or Codex CLI hook on PR open), and Sentry Seer (production‑aware — does this diff plausibly cause issues you’ve seen in production). You also have a custom /review slash command in .claude/commands/review.md that knows your codebase’s specific anti‑patterns — your auth quirks, your data‑residency rules, your “never call this API without idempotency keys” conventions — and you trigger it on every PR you author yourself. For high‑risk diffs (auth, payments, migrations, public APIs), you also run /ultrareview — a multi‑agent setup where a planner agent splits the diff by domain and dispatches sub‑agents in parallel to review security, performance, tests, and breaking‑change risk separately, then reconciles findings. You read the combined output, deduplicate the obvious overlap, and address the union of actionable findings. CI then runs a final auto‑gate (lint, type‑check, tests, secret scan) before merge.

The lower tiers are easy to spot. 0 pts (“manual review only”): no AI reviewer on PRs, every comment is a human chasing the issues AI introduced. You ship more bugs per PR than your colleague using even one tool, and you don’t know it. 1 pt (“one AI reviewer”): you have CodeRabbit (or Bugbot, or Greptile) on PRs. Better than nothing — measurably — but you still miss the categories that tool isn’t great at. CodeRabbit’s breadth misses the cross‑file logic Greptile catches; Greptile’s precision misses the style nits CodeRabbit surfaces; neither sees production. 2 pts (“two AI reviewers + custom /review”): you’ve doubled up — CodeRabbit + Claude Code Action, or Greptile + Codex — and you have a repo‑specific /review slash command. Substantially fewer surprises in prod, but still no production‑aware reviewer and no multi‑agent escalation for high‑risk work. 3 pts (max): the full layered stack — three‑plus parallel AI reviewers with different specialties, a custom /review slash command, and a multi‑agent /ultrareview for high‑risk diffs. Across a quarter, your bug‑per‑PR rate at the lower bound of human‑authored PRs, not the AI‑authored 1.7× ceiling.

Current landscape (web‑search‑verified)

Section titled “Current landscape (web‑search‑verified)”

The AI code‑review market consolidated through 2025 around four or five serious tools with different specialties — and the data from 2026 head‑to‑heads (three‑week, 146‑PR, 679‑finding benchmark on DEV.to; Greptile’s own published benchmark; Surmado’s 5 Best CodeRabbit Alternatives; Rick Hightower’s Claude Code Ultrareview vs CodeRabbit vs Greptile writeup) is unambiguous: no single reviewer wins on every axis, and the strengths are non‑overlapping enough that stacking them is rational, not redundant. Pick reviewers by the gaps in the others, not by which one “wins.”

CodeRabbit is the breadth leader. In the 3‑week DEV.to benchmark, CodeRabbit produced 281 findings out of 679 (~41% of total) with 68.3% one‑click diff coverage — meaning more than two‑thirds of its suggestions came with a directly applicable patch you could accept with a button click. That’s the highest “applyability” of any tool in the field. Strengths: style, refactoring, formatting, surface‑level security, and the broadest catalog of “small annoying things you would have caught on a careful re‑read.” Pricing in 2026 is hourly‑rate‑limited and structurally friendly to small teams; the “active contributor” framing means you pay roughly per human, not per review. Weaknesses: shallower cross‑file context than Greptile, more noise on large PRs (you’ll dismiss 30–40% of suggestions as nitpicks), and its bug‑shaped finding rate is below Greptile’s. Use CodeRabbit as the always‑on first line — it’ll catch the obvious stuff in 60 seconds and let your other reviewers focus on what’s actually hard.

Greptile is the precision and cross‑file specialist. The DEV.to benchmark recorded zero false positives across 120 findings, ~92% bug‑shaped, and Greptile’s own benchmark claims 82% catch rate vs 58% for Bugbot and 44% for CodeRabbit on a curated bug set. Strengths: deep repo indexing, cross‑file logic, “this function changed in PR #1842, the caller in this PR didn’t get updated” kinds of catches that CodeRabbit cannot see. Greptile is purpose‑built for monoliths and large codebases where the answer to “is this safe?” depends on what’s three files away. Weaknesses: pricing is per‑review‑overage and harshest at scale (some teams are paying hundreds per month at high PR volume), and it generates fewer total findings — so on its own it can feel “quiet” compared to CodeRabbit. The right framing: Greptile is high‑signal, CodeRabbit is high‑coverage. Stack them.

The Claude Code GitHub Action (anthropics/claude-code-action) lets you trigger Claude Code reviews from inside GitHub itself — either by mentioning @claude in a PR comment, by adding a claude-review label, or via a workflow on: pull_request event. The killer property is that it shares your local Claude Code setup: the same CLAUDE.md, the same custom subagents, the same skills, the same /review slash command you use locally. Strengths: deep agentic reasoning (it can run your tests, read across files, reason about architecture, and even open follow‑up PRs with fixes), shares context with your local workflow, and you can wire it to a code‑reviewer subagent for repo‑specific conventions. Weaknesses: slower than CodeRabbit/Greptile (60–180s for a non‑trivial review), costs scale with usage, and it’s more sensitive to prompt quality than the more product‑ized tools. Use it as your agentic deep‑dive — the one reviewer that can not just flag a problem but propose and apply the fix.

OpenAI’s Codex Cloud (the ChatGPT‑hosted Codex surface, GA throughout 2026) ships a /review flow that runs against the open PR with GPT‑5.5 as the default model. Strengths: GPT‑5.5 leads on Terminal‑Bench 2.0 (77.3%) and is competitive on SWE‑bench Verified; the Cloud surface is fast (sub‑60s on most diffs) and the review tone is more “senior engineer” than “linter.” Like Claude Code Action, Codex review benefits hugely from a populated AGENTS.md in the repo root — the more conventions and gotchas you’ve written down, the sharper the review. Weaknesses: less deeply integrated with the GitHub PR UI than CodeRabbit; you’ll typically read the review in the Codex Cloud tab and bring conclusions back to the PR comments manually. Use Codex review when you want a second opinion from a different model family to break the Anthropic monoculture — having both Claude and GPT‑5.5 review the same diff routinely catches issues neither would alone.

Sentry Seer is the production‑aware reviewer — the one tool in the field that knows about your runtime errors, your slow transactions, your failed releases, because it’s sitting on top of your Sentry data. The DEV.to benchmark recorded Seer flagging 40 high‑severity and 6 critical‑severity bugs with a perfect 6/6 critical tier (zero false positives) — meaning when Seer flags something critical, it’s almost certainly worth your time. Strengths: ties PR diffs to actual production failure patterns (“this kind of nil check was the root cause of issue SENTRY‑1234 you fixed last sprint”), production‑plausible bug surfacing that no static‑only reviewer can replicate, and “active contributor” pricing that’s friendly for small teams. Weaknesses: only as good as your Sentry coverage — if you don’t have Sentry instrumented for the surface you’re shipping to, Seer has nothing to ground on. Use Seer as the last‑mile production gate: it’s the reviewer that catches “you fixed this exact bug six months ago, here’s the issue link.”

Custom /review and /ultrareview multi‑agent review

Section titled “Custom /review and /ultrareview multi‑agent review”

The two product‑less layers — the ones you build yourself — are where max‑score Q17 separates from off‑the‑shelf two‑tool setups. A custom /review lives at .claude/commands/review.md and encodes your repo’s specific anti‑patterns: “always check we set the Authorization header before calling /api/admin/*,” “never throw inside an effect() block,” “every migration must have a rollback step.” It’s a short slash command — 30 to 80 lines of markdown — that you trigger on every PR you author. Free, fast, and ruthlessly specific to your codebase in a way that no general‑purpose reviewer can match. /ultrareview is the multi‑agent escalation: a slash command (or skill) that spawns 3–5 parallel subagents — typically a security reviewer, a performance reviewer, a test‑coverage reviewer, a breaking‑change reviewer, and a UX reviewer — each with their own prompt and focus, runs them in parallel via the Task tool, then a synthesizer agent reconciles findings into a deduplicated report. Rick Hightower’s writeup Claude Code Ultrareview vs CodeRabbit vs Greptile (April 2026) showed /ultrareview catching architectural issues neither CodeRabbit nor Greptile flagged — at the cost of 2–4 minutes per review. Reserve it for high‑risk diffs.

  1. Inventory your current PR review automation. Open your last 10 merged PRs and count: how many AI reviewers commented? If the answer is 0 or 1, your current Q17 score is 0–1 pts and this guide is the fix. If 2, you’re at 2 pts and the gap to max is one production‑aware tool plus /ultrareview. Pin the inventory somewhere so you can re‑audit after.
  2. Install CodeRabbit as the always‑on first line. Go to coderabbit.ai, install the GitHub App, point it at your repo, and add a .coderabbit.yaml at the repo root tuning the noise level (set reviews.profile: "chill" for monorepos, "assertive" for tight teams). Within an hour, it’ll comment on every PR. Don’t over‑configure on day one — accept the defaults, see what you get, and trim noise after a week of data.
  3. Add Greptile as the high‑signal cross‑file reviewer. Sign up at greptile.com, install the GitHub App, and let it index your repo (large monorepos take 10–30 minutes; smaller repos finish in 2). Greptile and CodeRabbit do not duplicate each other — Greptile’s findings overlap CodeRabbit’s at roughly 10–15% on typical diffs, so the marginal cost of running both is low.
  4. Wire the Claude Code GitHub Action with your local CLAUDE.md and a code‑reviewer subagent. Add .github/workflows/claude.yml running anthropics/claude-code-action@v2 triggered on issue_comment containing @claude and on PRs labeled claude-review. Set ANTHROPIC_API_KEY as a repo secret. Critically, commit your CLAUDE.md and your .claude/agents/code-reviewer.md subagent definition — the Action shares context with your local Claude Code setup, so a populated CLAUDE.md and a custom code‑reviewer subagent are what make the Action sharp instead of generic.
  5. Enable Codex Cloud review for a second model family. In Codex Cloud (chatgpt.com/codex), connect your GitHub org and enable PR review. The friction here is lower than people think — it’s mostly OAuth and a per‑repo toggle. Commit an AGENTS.md at the repo root with the same conventions you put in CLAUDE.md (or symlink them). Now both Anthropic and OpenAI families review the same PRs in parallel, and you’ll routinely catch issues where one model’s blind spot was the other’s strength.
  6. Add Sentry Seer for production‑aware review. Pre‑requisite: you have Sentry actively instrumented in production. In Sentry, enable Seer for your project and install the Seer GitHub App. Seer will start commenting on PRs within a day, tying diffs to historical production issues. Spend the first week tuning what Seer comments on (it can be chatty on greenfield repos with low Sentry signal; trim with the severity_threshold setting).
  7. Write a custom /review slash command for your repo’s anti‑patterns. Create .claude/commands/review.md with a short prompt: “Review the current PR diff. Specifically check for: [your repo’s 5–10 actual rules]. For each violation, quote the line and explain the rule. End with a verdict: APPROVE / REQUEST CHANGES / NEEDS DISCUSSION.” Commit it. Now /review in Claude Code (or via the GitHub Action) runs your checklist, not a generic one. Update the prompt monthly as new anti‑patterns emerge from postmortems.
  8. Build /ultrareview for high‑risk diffs. Create .claude/commands/ultrareview.md (or a skill at .claude/skills/ultrareview/SKILL.md) that spawns parallel subagents — security, performance, tests, breaking‑change, UX — via the Task tool, each with a focused prompt. End with a synthesizer step that reconciles findings into a single deduplicated report. Trigger it manually on auth, payments, migration, and public‑API diffs. Expect 2–4 minutes per run; that’s fine — high‑risk diffs deserve the time.
  9. Set the CI auto‑gate as the last layer. Lint, type‑check, tests, secret scan all run on every PR via GitHub Actions and block merge on failure. This is not an AI reviewer — it’s the deterministic backstop under the AI layer. Without it, the AI reviewers have to also catch typos and unformatted imports, which wastes their budget on things a linter handles for free.
  10. Audit weekly for two weeks. At the end of each week, look at every merged PR. Tally: how many AI reviewers commented? Were findings actionable? What did each tool catch that the others missed? You’re looking for evidence that the layers are non‑overlapping — if CodeRabbit and Greptile catch the same 90% of issues, drop one. If your tools are catching different categories, you’re at max score and the only adjustment is tuning noise.
  • Reviewer overload — the “AI noise floor” problem. With four or five reviewers commenting, a typical PR can accumulate 30–60 AI comments. Symptom: developers stop reading them, hit “resolve all,” and merge regardless. Fix: tune each tool’s noise floor in its config (reviews.profile: "chill" for CodeRabbit, severity_threshold for Seer, prompt the Claude/Codex reviewers to “only flag P0–P1 issues, ignore style”), and add a synthesizer step (/ultrareview’s reconciliation pattern works here too) that deduplicates across tools. The point is signal, not volume.
  • Conflicting suggestions between reviewers. CodeRabbit says “extract this into a helper,” Greptile says “inline it for clarity.” Symptom: developers freeze or pick the suggestion from whichever tool they trust more, leading to inconsistent code. Fix: write down a tie‑breaker rule in CLAUDE.md (“when CodeRabbit and Greptile disagree on style, prefer Greptile; when they disagree on logic, prefer CodeRabbit”), or have your custom /review slash command explicitly arbitrate. Don’t pretend tools never conflict.
  • No human in the loop. Symptom: PRs auto‑merge after all AI reviewers approve, and a bug lands in prod that any human reviewer would have caught in 30 seconds because it’s a product/UX issue, not a code issue. Fix: AI reviewers gate merge readiness, humans gate intent. Every non‑trivial PR still needs one human approval — the AI layer makes that approval cheaper (the human doesn’t have to chase style nits and typos), not unnecessary.
  • Stacking tools that overlap instead of complement. Running CodeRabbit + Cursor BugBot is two breadth tools — you’ll get more findings but mostly the same findings. Symptom: marginal cost of the second tool is real, marginal signal is near zero. Fix: pick reviewers by gap, not by brand. Breadth (CodeRabbit) + Precision (Greptile) + Agentic (Claude Action) + Different model (Codex) + Production (Seer) is five non‑overlapping axes.
  • Treating /ultrareview as the daily driver. Symptom: every PR runs the 2–4‑minute multi‑agent pipeline, the PR queue clogs, developers start skipping PRs to “save time.” Fix: /ultrareview is for high‑risk diffs only — auth, payments, migrations, public APIs. Everyday PRs get the always‑on stack (CodeRabbit + Greptile + Claude/Codex + Seer). Save the heavy machinery for diffs that deserve it.
  • Forgetting to commit CLAUDE.md / AGENTS.md / .coderabbit.yaml. Symptom: your Claude Code Action and Codex Cloud reviews come back generic, missing your repo’s conventions. Fix: every reviewer that can read repo context (Claude Action, Codex Cloud, your custom /review) needs the context committed to the repo, not just to your local machine. Symlink CLAUDE.md and AGENTS.md so both ecosystems read the same source of truth.
  • Skipping the layer when CI is green. Symptom: CI passes, developer hits merge without reading any of the AI review comments. Fix: branch protection requires at least one AI reviewer’s “approve” state (CodeRabbit’s “LGTM” status check is the easiest one to require) in addition to a human approval. Make ignoring the AI layer impossible by policy, not by hope.
  • Every merged PR over the last two weeks has comments from at least three AI reviewers with different specialties (breadth + precision + agentic / production‑aware).
  • You can point to a .coderabbit.yaml, a Greptile install, a .github/workflows/claude.yml, a Codex Cloud connection, and a Seer install — all live and active.
  • .claude/commands/review.md exists and encodes your repo’s specific anti‑patterns; you trigger it on PRs you author.
  • .claude/commands/ultrareview.md (or skill) exists and you have triggered it on at least one high‑risk diff in the last month.
  • Branch protection requires both a human approval and at least one AI reviewer status check before merge.
  • Noise floors are tuned: AI comments per PR are typically 5–15 (signal), not 30–60 (noise).
  • Bug‑per‑PR rate over the last quarter is at or below human‑authored baseline — measurably below the 1.7× AI‑authored ceiling.
  • Postmortem retros routinely cite “the AI reviewer caught this” or “we need to teach our /review about this pattern” — proof the layer is doing work, not theater.