Skip to content

Structuring a non-trivial change — Explore → Plan → code → PR → review-fix

Q25 · Operations & hygiene How do you structure a non-trivial change?

Max-score answer: “Full loop: parallel Explore subagents → Plan mode → code → auto-PR → review-fix loop.”

Why it matters: Skipping research and planning is the single biggest failure mode in 2026. The agent’s confidence is not correctness.

Why this matters in 2026 (confidence ≠ correctness, top failure mode)

Section titled “Why this matters in 2026 (confidence ≠ correctness, top failure mode)”

The biggest leak in a 2026 agent-first workflow is not the model, the IDE, or the MCP stack — it is that a confident agent will happily write 400 lines of plausible code in the wrong direction, in the wrong file, against the wrong abstraction, and finish with the same calm “Done!” it would produce for a correct change. Confidence is not correctness. The largest single contributor to wasted token spend and wall-clock time on non-trivial tasks is “wrong-direction rework” — sessions where the model committed edits in a direction the user would have rejected in the first sixty seconds if they had seen the plan. The model is producing a posterior over text, not over the codebase; absent forced research and planning it will reach for the most fluent-sounding answer rather than the most correct one. The fix is structural, not motivational. You do not solve confidence-not-correctness by reminding yourself to be careful. You solve it by making “research first, plan second, code third, ship fourth, fix fifth” the only path the agent is allowed to take — wiring each phase to a different mechanic so no step can be skipped without you noticing. That is the five-phase loop, and Q25 is the operational hygiene check that asks whether you have actually built it.

What “max score” actually looks like (the 5-phase loop, real example)

Section titled “What “max score” actually looks like (the 5-phase loop, real example)”

A max-score Q25 setup runs every non-trivial change through five mechanically distinct phases. Phase 1: parallel Explore subagents map the relevant files, call sites, tests, and APIs in isolated read-only contexts and return tight summaries to the main session. Phase 2: Plan mode turns the research into an explicit, file-by-file plan you read and iterate on before any side effect lands. Phase 3: code executes the approved plan with the plan pinned as the anchor. Phase 4: auto-PR fires the moment work finishes — a Stop hook commits, pushes, opens the PR. Phase 5: review-fix loop wakes the agent back up when CodeRabbit, Copilot Code Review, a human reviewer, or a failing GitHub Actions check posts new comments, addresses the actionable ones, and re-pushes with a bounded backoff until the comments stop.

Concrete example. The task: “swap our Stripe Checkout integration for Polar.” A direct-edit agent would open checkout.ts, swap the SDK, push, and break three webhook handlers, two test fixtures, and a Cloudflare Workers binding it never read. The max-score loop instead spawns three Explore subagents in parallel — mapping stripe call sites, reading webhook handlers and tests, fetching Polar SDK docs via Context7 — returning three short briefs in under two minutes. The main session enters Plan mode and writes a 14-step plan. You push back on step 7 (“the webhook secret should be a Cloudflare secret, not an env var”), iterate twice, approve. The agent codes with the plan pinned. The Stop hook commits, pushes, opens the PR. CodeRabbit posts three comments — two LGTMs, one actionable nit about a missing await. The agent fixes it, pushes, waits four minutes, gets no new comments, queues gh pr merge --auto. Total wall-clock: ~35 minutes for a change that would have been ~3 hours of edit-then-debug. That is the gap Q25 measures.

The lower tiers map cleanly onto missing phases. 0 pts: prompt and merge by hand. 1 pt: Phase 2 on big tasks but no parallel research. 2 pts: Phases 2 and 4 wired up, but Phase 1 pollutes main context and Phase 5 is “I’ll check later.” 3 pts (max): all five phases wired together, running by default.

The pieces of the loop are all standard 2026 mechanics now — they just rarely get composed end-to-end. Each phase has its own tooling, its own quirks, and its own failure modes, and the leverage comes from wiring them so the output of one is the input of the next without manual hand-off.

Phase 1: Parallel Explore subagents (research in isolated context)

Section titled “Phase 1: Parallel Explore subagents (research in isolated context)”

The Explore subagent is one of three built-in Claude Code subagent types in 2026 (alongside Plan and General-purpose), and the most underused primitive in the stack. Read-only by design, runs on Haiku for speed, and operates in its own context window so research does not pollute the main session’s tokens. When a task touches more than three files, spawn two to four Explore subagents in one turn, each with a tightly scoped research question (“map every file that imports stripe,” “read the webhook tests and summarise the contract,” “fetch Polar Checkout SDK docs via Context7”), and merge the summaries back. A 15-minute sequential research task takes 3–4 in parallel. The non-obvious win is context economy: the main session keeps a clean window with only synthesised briefs, not raw rg output of fifty files. That clean window is what makes Phase 2 work. The trap: subagents only parallelise cleanly when paths do not depend on each other. Cursor’s Agent tabs (3.0+, April 2026) offer a similar primitive; Codex CLI has none in May 2026 — closest workaround is multiple codex shells with scoped prompts.

Phase 2: Plan mode (turn research into approach)

Section titled “Phase 2: Plan mode (turn research into approach)”

Plan mode is the gate that turns research briefs into a reviewable, file-by-file plan before any side effect lands. In Claude Code, Shift+Tab twice cycles in (or /plan since January 2026); Cursor uses the same shortcut in the Agents window; Codex CLI has no native Plan mode — standard workaround is a two-step prompt (“produce a detailed plan only” → iterate → “now execute”) paired with --ask-for-approval on-request. The crucial discipline is iteration, not acceptance — two to five rounds before approval, pushing back on assumptions, demanding specifics, rejecting steps already done, asking for the rollback story. Each iteration costs ~20 seconds and saves minutes of wrong-direction edits. In Claude Code, Ctrl+G opens the plan file under ~/.claude/plans/ in your $EDITOR for surgical edits. Paste the approved plan into the Phase 3 prompt explicitly — do not rely on the agent to “remember” it across the mode transition.

Phase 3: Code (main session, with the plan as anchor)

Section titled “Phase 3: Code (main session, with the plan as anchor)”

Once research is done and the plan approved, execution should be the cheap, mechanical phase — the agent works through numbered plan steps with the plan pinned as load-bearing context. The discipline: do not edit the plan mid-code. If it turns out to be wrong, stop, exit edit mode, return to Phase 2, iterate, then re-enter. Do not let the agent silently improvise a different design — that is exactly the wrong-direction rework the loop prevents. Pair with Auto-Accept (Claude Code) or YOLO mode (Cursor) so throughput is not bottlenecked on hand-approving each diff. Boris Cherny’s public workflow — “plan in Plan mode, approve, drop into Auto-Accept, walk away” — is the right shape. Fielding clarifying questions here means the plan was underspecified; that is a Phase 2 bug, go back rather than answer inline.

Phase 4 is the auto-PR Stop hook (covered in depth in the Q16 guide). The instant the agent’s main turn ends with uncommitted work or unpushed commits, a Stop hook script — typically ~/.claude/hooks/auto-pr-watch.sh — resolves the default branch, branches off it if necessary, stages, commits with a conventional message, pushes, and opens the PR via gh pr create. The agent never sits on its hands waiting for “shall I push?” — the decision lives in code, not in your working memory. The hook is deterministic shell wired into ~/.claude/settings.json. Without Phase 4 the loop collapses back into manual bookkeeping and Q25 caps at 2 points no matter how good your Phases 1–3 are.

Phase 5: Review-fix loop (auto-respond to CodeRabbit/CI/human)

Section titled “Phase 5: Review-fix loop (auto-respond to CodeRabbit/CI/human)”

Phase 5 is the part most people skip and the reason their Q25 caps short of max. After the PR is open, the Stop hook keeps watching: top-level comments via gh pr view --comments, inline review comments via gh api repos/.../pulls/N/comments, review summaries via gh api repos/.../pulls/N/reviews, CI status via gh pr checks. When CodeRabbit (which spawns its own coding agent via Autofix on Pro plans, April 2026), Copilot Code Review (~71% actionable-feedback rate per GitHub’s 2026 telemetry), a human reviewer, or a failing GitHub Actions run posts new comments, the hook compares against a stored signature (action, branch/PR#, head_sha, max_comment_id) and wakes the agent with a fix prompt. The agent reads every actionable comment, implements the fix, commits, pushes, and the hook waits 4 minutes → 12 minutes for late comments — 4 minutes stays inside the prompt-cache TTL, 12 minutes covers slower CI and human follow-up. When the loop settles, gh pr merge --squash --delete-branch --auto queues the merge for whenever required checks pass. Never --admin-override. The signature file is what makes the loop terminate naturally.

Step-by-step: making the full loop your default

Section titled “Step-by-step: making the full loop your default”
  1. Audit your last ten non-trivial changes. For each, mark which of the five phases actually ran. If fewer than seven out of ten ran all five, the rest of this guide is the fix.

  2. Wire the Phase 1 trigger. Write a slash command (.claude/commands/explore-then-plan.md) that opens with “Spawn three Explore subagents in parallel: map all relevant files, read the existing tests, fetch external docs via Context7. Wait for all three, then enter Plan mode.” Now Phase 1 is one keystroke. For Cursor, a saved Composer prompt; for Codex CLI, a shell alias.

  3. Make Plan mode the default boot state. Hit Shift+Tab twice (or /plan) the moment you launch Claude Code, before typing the task. Treat it like seatbelts: not a choice, a reflex. The full Q6 guide has the habit-building protocol.

  4. Pin the approved plan into the Phase 3 prompt. When you exit Plan mode, paste the plan back into the next message as the explicit anchor: “Execute this plan. Do not improvise. If a step turns out to be wrong, stop and ask.” That line is the difference between Phase 3 staying on rails and the agent silently re-planning mid-code.

  5. Install the auto-PR Stop hook. Wire ~/.claude/hooks/auto-pr-watch.sh into ~/.claude/settings.json with a Stop matcher. The script should resolve the default branch dynamically, decide between ship / fix / noop from the current repo signature, and either drive gh directly or re-wake the agent. The Q16 guide has the script skeleton.

  6. Tune the Phase 5 backoff to your review tempo. 4 minutes → 12 minutes is the field-tested default. With CodeRabbit Pro Autofix, add a third 25-minute wait. The signature file ~/.claude/auto-pr-state/<hash>.json is what makes the loop self-terminate.

  7. Document the opt-out phrase. Pick a sentence like “don’t make a PR for this” — the agent skips Phase 4 entirely. Safety valve for spikes and prototypes. Add it to your project CLAUDE.md.

  8. Audit weekly for two weeks. Target: >70% of non-trivial changes ran all five phases. If below, the Phase 1 slash command is not sticking — make it more visible. After two weeks of >70% the loop is self-sustaining.

  • Skipping Phase 1 because “I know the codebase.” The most common failure mode in 2026 and the one the scorecard specifically targets. Even on a codebase you wrote, the model does not — and parallel Explore subagents are cheap (Haiku, isolated context, 3–4 minutes wall-clock). The cost shows up as wrong-direction rework in Phase 3.
  • Plan mode as a rubber stamp. Expert iteration is two to five rounds, pushing back on assumptions and demanding specifics. Approvals under 30 seconds on non-trivial tasks mean you are rubber-stamping.
  • Ignoring the plan mid-code. The agent will sometimes notice a “better” approach and silently take it. Pin the plan in the prompt, instruct the agent to stop and ask if a step is wrong, audit the diff against the plan before merge.
  • No review-fix loop. “I have a hook that opens PRs but check reviews manually” is mid-tier. Late comments are the norm — CodeRabbit follows up 30–90 seconds after your push, Copilot can take 2–5 minutes, humans drift in over hours. Without bounded backoff you pay the review-roundtrip tax on every PR.
  • Treating every comment as actionable. Acknowledgements, off-topic chatter, and answered questions should not trigger a Phase 5 fix cycle. Filter explicitly.
  • --admin-override on auto-merge. Never. Failing checks and CHANGES_REQUESTED reviews are real blockers — the loop surfaces them, never bypasses.
  • No signature file. Without the per-repo signature, the Stop hook spams the PR or wedges the agent into an infinite Phase 5 loop. The signature is what makes Phase 5 terminate.
  • Confusing “verified” with “tested.” Bot reviews do not replace E2E verification (Q18). Pair the loop with agent-driven browser tests.
  • On non-trivial tasks, you spawn two to four parallel Explore subagents before any planning, and their summaries return to a clean main context.
  • Plan mode is the default boot state — you decide whether to skip it, not whether to use it.
  • You iterate in Plan mode two to five times before approval; sub-30-second approvals are rare.
  • The approved plan is pinned into the Phase 3 prompt as an explicit anchor.
  • When work finishes, a PR appears on GitHub without you running gh pr create manually.
  • When CodeRabbit / Copilot / a human / a failing CI check posts comments, the agent wakes up and addresses them within ~5 minutes.
  • Auto-merge is queued with --auto, never forced with --admin.
  • An opt-out phrase (“don’t make a PR for this”) actually skips Phase 4.
  • Across a full week, more than 70% of non-trivial changes ran all five phases end-to-end.