Skip to content

Recovery Patterns

Your payment provider has a 90-second outage at 2am. Without a circuit breaker, every checkout request piles up waiting on a dead dependency, your connection pool drains, and the whole app falls over — not just payments. The fix is not “write a resilience framework.” It’s wiring a battle-tested library (opossum, cockatiel, Resilience4j) around the one flaky call, with the right thresholds and a fallback — and that is exactly the kind of mechanical-but-fiddly work an AI tool does well, if you point it at the real dependency instead of a toy example.

This recipe shows the prompts that generate runnable circuit breakers, retries, and fallbacks against named libraries, plus how to prove the breaker actually opens with a chaos test. The hand-rolled class CircuitBreaker is the thing the AI deletes, not the thing you ship.

  • A prompt that wraps a named flaky dependency in an opossum (Node) or Resilience4j (JVM) circuit breaker with state-change logging and a fallback
  • A prompt that adds jittered exponential-backoff retry to a specific fetch call, retrying only on 5xx and ECONNRESET — never on a 400
  • A prompt that writes a chaos test injecting a 500 from the payment service and asserting the breaker opens and the fallback fires
  • A before/after on wiring the Sentry MCP so the agent reads the production stack trace that justifies the breaker, instead of guessing
  • The failure modes that make AI-generated resilience code look correct in review and amplify an outage in production

The mistake that makes AI generate 200 lines of bespoke Semaphore and Bulkhead classes is asking it to “implement a circuit breaker.” Ask it instead to integrate a specific library around a specific call. Resilience is a solved problem — opossum and cockatiel on Node, Resilience4j on the JVM, tenacity on Python — and the agent writes a thin, correct adapter far more reliably than it invents the state machine.

Step 1: Wrap the flaky call in a real circuit breaker

Section titled “Step 1: Wrap the flaky call in a real circuit breaker”

The circuit breaker is the highest-leverage pattern: it stops a slow or dead dependency from exhausting your own resources. The workflow differs per tool — Cursor edits the route in place with the file open, Claude Code scaffolds and runs the unit test in one headless pass, and Codex opens the change on a branch — so here is the three-tool split.

Open the route file (src/routes/checkout.ts), select the raw paymentService.charge(...) call, and use agent mode (Cmd/Ctrl+I). Cursor edits the call in place and shows a diff you accept hunk-by-hunk — the right mode when you want to watch the integration land in a file you know.

What the agent should produce is small. With opossum it is roughly this — note there is no state machine, just configuration and a fallback:

import CircuitBreaker from 'opossum';
const options = { timeout: 3000, errorThresholdPercentage: 50, resetTimeout: 30000 };
const breaker = new CircuitBreaker((order: Order) => paymentService.charge(order), options);
breaker.fallback((order: Order) => {
queue.enqueue('payments', order);
return { status: 'queued' as const };
});
breaker.on('open', () => log.warn('payment breaker OPEN — shedding load to queue'));
breaker.on('halfOpen', () => log.warn('payment breaker HALF_OPEN — probing'));
export const chargeWithBreaker = (order: Order) => breaker.fire(order);

If you are on the JVM, the equivalent prompt should name Resilience4j and Spring Boot’s @CircuitBreaker annotation with a fallbackMethod, so the agent annotates your existing @Service method instead of hand-rolling a slidingWindow. On Python, name tenacity for retries and pybreaker for the breaker.

Step 2: Add a retry that doesn’t make the outage worse

Section titled “Step 2: Add a retry that doesn’t make the outage worse”

Naive retries amplify load during a partial outage — three retries times every client equals a self-inflicted DDoS. The two rules that matter: jittered backoff (so clients don’t retry in lockstep) and a strict retryable-error predicate (never retry a 400 or a non-idempotent POST you can’t dedupe). Reach for cockatiel or p-retry on Node rather than a bespoke loop. This step is identical across all three tools — it is a single localized edit — so pick whichever you have open.

The “wrap the breaker inside the retry” detail matters and the agent will skip it unless told: retries should count toward the breaker’s failure rate, so a sustained outage opens the breaker instead of being masked by infinite local retries.

Step 3: Layer a fallback chain for graceful degradation

Section titled “Step 3: Layer a fallback chain for graceful degradation”

For reads, a fallback chain keeps the feature alive when the primary source is down: primary DB then read replica then cache (flagged stale) then a safe default. The opinionated point: each tier must be meaningfully degraded and labeled, or you ship silent staleness. Keep this as a small ordered list of async thunks — not a FallbackChain<T> framework.

The escape-hatch in that prompt — rethrow NotFound instead of falling through — is the line that separates resilience from a bug factory. A fallback that swallows a legitimate 404 and serves a cached profile is worse than the outage.

Step 4: Use the Sentry MCP so the agent reasons over the real failure

Section titled “Step 4: Use the Sentry MCP so the agent reasons over the real failure”

Up to here the agent is guessing which dependency is flaky. Wire the Sentry MCP and it reads the actual production issue — stack trace, frequency, the failing span — before proposing a breaker. That is the difference between “add resilience somewhere” and “circuit-break this call because it threw ETIMEDOUT 1,182 times in the last hour.”

MCP setup is nearly identical across the three tools — all three speak MCP, and Sentry is a hosted HTTP server, so there is no package to install:

Add to .cursor/mcp.json (or the global ~/.cursor/mcp.json), then approve it in Settings → MCP:

{
"mcpServers": {
"sentry": { "url": "https://mcp.sentry.dev/mcp" }
}
}

With it connected, the before/after is stark. Before: “Add a circuit breaker to the checkout flow” produces a breaker on whatever call the agent guesses is risky. After:

If a persistent connection feels heavy for a one-off triage, a single-purpose Agent Skill can be lighter than a full MCP server. Browse skills.sh and install with the universal CLI — npx skills add <owner/repo> (from vercel-labs/skills) — which works across Cursor, Claude Code, and Codex. Use a skill when you want a focused, repeatable augmentation; use the MCP when you want the agent to hold a live, queryable connection to your incident data.

A circuit breaker you never tripped in a test is a circuit breaker you are debugging during an incident. Inject the failure and assert the recovery — this is the step generic guides skip.

  1. Stand up a stub for the dependency you wrapped (e.g. nock for HTTP, or a mock of paymentService) that you can flip to failing on demand.

  2. Drive enough failing calls to cross errorThresholdPercentage, then assert the breaker is open and the fallback ran — not just that an error was thrown.

  3. Advance time past resetTimeout, send one success, and assert the breaker closes (the half-open probe). Use fake timers so the test runs in milliseconds.

For full system-level chaos beyond a unit test — killing a pod, adding latency at the network layer — generate a k6 load script or a Toxiproxy config with the same tool, but keep the breaker’s own behavior covered by the fast unit test above so it runs on every PR.

AI-generated resilience code is dangerous precisely because it looks correct. These are the failure modes to check before you ship:

  • Debugging Patterns — find the root cause that justifies a breaker in the first place
  • Logging Patterns — emit the structured, correlated logs that make a breaker’s state changes queryable
  • Monitoring Patterns — turn a tripped breaker into an alert that actually pages someone