Recovery Patterns

Your payment provider has a 90-second outage at 2am. Without a circuit breaker, every checkout request piles up waiting on a dead dependency, your connection pool drains, and the whole app falls over — not just payments. The fix is not “write a resilience framework.” It’s wiring a battle-tested library (opossum, cockatiel, Resilience4j) around the one flaky call, with the right thresholds and a fallback — and that is exactly the kind of mechanical-but-fiddly work an AI tool does well, if you point it at the real dependency instead of a toy example.

This recipe shows the prompts that generate runnable circuit breakers, retries, and fallbacks against named libraries, plus how to prove the breaker actually opens with a chaos test. The hand-rolled class CircuitBreaker is the thing the AI deletes, not the thing you ship.

What You’ll Walk Away With

A prompt that wraps a named flaky dependency in an opossum (Node) or Resilience4j (JVM) circuit breaker with state-change logging and a fallback
A prompt that adds jittered exponential-backoff retry to a specific fetch call, retrying only on 5xx and ECONNRESET — never on a 400
A prompt that writes a chaos test injecting a 500 from the payment service and asserting the breaker opens and the fallback fires
A before/after on wiring the Sentry MCP so the agent reads the production stack trace that justifies the breaker, instead of guessing
The failure modes that make AI-generated resilience code look correct in review and amplify an outage in production

The Workflow

The mistake that makes AI generate 200 lines of bespoke Semaphore and Bulkhead classes is asking it to “implement a circuit breaker.” Ask it instead to integrate a specific library around a specific call. Resilience is a solved problem — opossum and cockatiel on Node, Resilience4j on the JVM, tenacity on Python — and the agent writes a thin, correct adapter far more reliably than it invents the state machine.

Step 1: Wrap the flaky call in a real circuit breaker

The circuit breaker is the highest-leverage pattern: it stops a slow or dead dependency from exhausting your own resources. The workflow differs per tool — Cursor edits the route in place with the file open, Claude Code scaffolds and runs the unit test in one headless pass, and Codex opens the change on a branch — so here is the three-tool split.

Open the route file (src/routes/checkout.ts), select the raw paymentService.charge(...) call, and use agent mode (Cmd/Ctrl+I). Cursor edits the call in place and shows a diff you accept hunk-by-hunk — the right mode when you want to watch the integration land in a file you know.

Claude Code is the right tool when you want the breaker and a passing test in one non-interactive pass — for example inside a CI fixup job. Run it headless and let it edit, run the test, and iterate until green:

claude -p "Wrap paymentService.charge in src/routes/checkout.ts with an opossum \
circuit breaker (timeout 3000ms, errorThresholdPercentage 50, resetTimeout 30000), \
with a fallback that enqueues the order and returns { status: 'queued' }. Then add a \
vitest test in tests/checkout.breaker.test.ts that forces 5 consecutive rejections \
and asserts breaker.opened === true and the fallback ran. Run 'npm test' and fix \
failures until it passes." \
  --allowedTools Read Edit Bash

The -p flag runs once and exits (ideal for scripts); --allowedTools scopes what it can touch so a CI run can’t wander.

Codex shines when you want the change to arrive as a reviewable branch rather than edits in your working tree. From the CLI, let it operate in an isolated workspace and open a PR:

codex --sandbox workspace-write -c approval_policy=on-request \
  "Add an opossum circuit breaker around paymentService.charge in \
   src/routes/checkout.ts: timeout 3000ms, errorThresholdPercentage 50, \
   resetTimeout 30000, fallback enqueues the order and returns { status: 'queued' }. \
   Add a vitest spec proving the breaker opens after repeated failures. \
   Open a PR titled 'resilience: circuit-break payment charge'."

In Codex Cloud or the IDE extension the same prompt runs against a worktree, so the breaker and its test land on a branch you review — useful when the flaky call is in a service you do not want to edit live.

What the agent should produce is small. With opossum it is roughly this — note there is no state machine, just configuration and a fallback:

import CircuitBreaker from 'opossum';

const options = { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 };
const breaker = new CircuitBreaker((order: Order) => paymentService.charge(order), options);

breaker.fallback((order: Order) => {
  queue.enqueue('payments', order);
  return { status: 'queued' as const };
});
breaker.on('open', () => log.warn('payment breaker OPEN — shedding load to queue'));
breaker.on('halfOpen', () => log.warn('payment breaker HALF_OPEN — probing'));

export const chargeWithBreaker = (order: Order) => breaker.fire(order);

If you are on the JVM, the equivalent prompt should name Resilience4j and Spring Boot’s @CircuitBreaker annotation with a fallbackMethod, so the agent annotates your existing @Service method instead of hand-rolling a slidingWindow. On Python, name tenacity for retries and pybreaker for the breaker.

Step 2: Add a retry that doesn’t make the outage worse

Naive retries amplify load during a partial outage — three retries times every client equals a self-inflicted DDoS. The two rules that matter: jittered backoff (so clients don’t retry in lockstep) and a strict retryable-error predicate (never retry a 400 or a non-idempotent POST you can’t dedupe). Reach for cockatiel or p-retry on Node rather than a bespoke loop. This step is identical across all three tools — it is a single localized edit — so pick whichever you have open.

The “wrap the breaker inside the retry” detail matters and the agent will skip it unless told: retries should count toward the breaker’s failure rate, so a sustained outage opens the breaker instead of being masked by infinite local retries.

Step 3: Layer a fallback chain for graceful degradation

For reads, a fallback chain keeps the feature alive when the primary source is down: primary DB then read replica then cache (flagged stale) then a safe default. The opinionated point: each tier must be meaningfully degraded and labeled, or you ship silent staleness. Keep this as a small ordered list of async thunks — not a FallbackChain<T> framework.

The escape-hatch in that prompt — rethrow NotFound instead of falling through — is the line that separates resilience from a bug factory. A fallback that swallows a legitimate 404 and serves a cached profile is worse than the outage.

Step 4: Use the Sentry MCP so the agent reasons over the real failure

Up to here the agent is guessing which dependency is flaky. Wire the Sentry MCP and it reads the actual production issue — stack trace, frequency, the failing span — before proposing a breaker. That is the difference between “add resilience somewhere” and “circuit-break this call because it threw ETIMEDOUT 1,182 times in the last hour.”

MCP setup is nearly identical across the three tools — all three speak MCP, and Sentry is a hosted HTTP server, so there is no package to install:

Add to .cursor/mcp.json (or the global ~/.cursor/mcp.json), then approve it in Settings → MCP:

{
  "mcpServers": {
    "sentry": { "url": "https://mcp.sentry.dev/mcp" }
  }
}

claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

Then run /mcp in the REPL to complete the OAuth login.

Add to ~/.codex/config.toml:

[mcp_servers.sentry]
url = "https://mcp.sentry.dev/mcp"

With it connected, the before/after is stark. Before: “Add a circuit breaker to the checkout flow” produces a breaker on whatever call the agent guesses is risky. After:

If a persistent connection feels heavy for a one-off triage, a single-purpose Agent Skill can be lighter than a full MCP server. Browse skills.sh and install with the universal CLI — npx skills add <owner/repo> (from vercel-labs/skills) — which works across Cursor, Claude Code, and Codex. Use a skill when you want a focused, repeatable augmentation; use the MCP when you want the agent to hold a live, queryable connection to your incident data.

Step 5: Prove it with a chaos test

A circuit breaker you never tripped in a test is a circuit breaker you are debugging during an incident. Inject the failure and assert the recovery — this is the step generic guides skip.

Stand up a stub for the dependency you wrapped (e.g. nock for HTTP, or a mock of paymentService) that you can flip to failing on demand.
Drive enough failing calls to cross errorThresholdPercentage, then assert the breaker is open and the fallback ran — not just that an error was thrown.
Advance time past resetTimeout, send one success, and assert the breaker closes (the half-open probe). Use fake timers so the test runs in milliseconds.

For full system-level chaos beyond a unit test — killing a pod, adding latency at the network layer — generate a k6 load script or a Toxiproxy config with the same tool, but keep the breaker’s own behavior covered by the fast unit test above so it runs on every PR.

When This Breaks

AI-generated resilience code is dangerous precisely because it looks correct. These are the failure modes to check before you ship:

Thresholds too tight → false trips. The agent loves to default errorThresholdPercentage to a low number or volumeThreshold to a handful of requests, so a couple of slow calls open the breaker and you shed load you didn’t need to. Set the volume threshold high enough that a normal traffic blip can’t trip it, and base resetTimeout on your dependency’s real recovery time, not a round 60 seconds.

Retries amplifying a partial outage. If the retry policy wraps the outside of the breaker (or there’s no breaker), every client hammers a struggling dependency and turns a brownout into a full outage. Always nest the breaker inside the retry, and confirm retries count toward the failure rate.

Fallback masking a real failure. A fallback that catches everything — including 400s, auth errors, and missing-record 404s — converts loud bugs into silent wrong answers. Scope the fallback to infrastructure errors only and let domain errors propagate. This is the single most common defect in generated fallback chains.

Non-idempotent retries double-charging. Retrying a POST /charge without an idempotency key can bill a customer twice. If the call isn’t idempotent, the agent must add a dedupe key (or you must restrict retries to GET/idempotent operations) — ask for it explicitly.

The breaker hides the alert. Once a breaker opens and the fallback works, the user-facing error rate looks healthy and nobody gets paged — while a dependency is fully down. Emit a metric on breaker.on('open') and alert on that, so a tripped breaker is itself a signal.

What’s Next

Debugging Patterns — find the root cause that justifies a breaker in the first place
Logging Patterns — emit the structured, correlated logs that make a breaker’s state changes queryable
Monitoring Patterns — turn a tripped breaker into an alert that actually pages someone