Skip to content

AI-Powered Disaster Recovery

Your primary region just went dark mid-deploy. The on-call channel is on fire, the status page is still green because nobody updated it, and leadership wants an ETA you do not have. Your runbook is a Confluence page last edited fourteen months ago that references a Postgres host that was decommissioned in the spring.

Disaster recovery is the one area where “we will write the docs later” quietly becomes “we lost six hours and a chunk of customer data.” The good news: the slow, tedious parts of DR — drafting precise PITR runbooks, writing chaos experiments that actually assert your RPO, and turning a messy incident timeline into a blameless post-mortem — are exactly what AI coding tools are good at, as long as you anchor them to real tools and keep a human at the approval gate.

  • A copy-paste prompt that turns your stack into a Postgres point-in-time-recovery runbook with explicit RTO/RPO targets and WAL-chain verification
  • A Chaos Mesh Schedule that kills your primary DB pod every week and asserts failover under a 5-minute RPO — generated, not hand-typed from memory
  • A three-tool workflow (Cursor / Claude Code / Codex) for keeping DR artifacts in version control and re-validating them in CI
  • A prompt that drafts a blameless post-mortem from your real logs and timeline
  • A short list of failure modes that silently break DR plans, and how to catch them before the disaster does

Every DR artifact downstream depends on two numbers per service tier: RTO (how long until service is restored) and RPO (how much data you can afford to lose). Write them down explicitly before you generate anything — a runbook that targets “fast recovery” is useless; one that targets “RTO 15 min, RPO 5 min for payments” is testable.

A realistic starting matrix looks like this:

TierExample servicesRTORPOBackup approach
Criticalauth, payments15 min0–5 minSynchronous/streaming replica + WAL archiving
Highcore API1 hour5 minWAL archiving + frequent base backups
Mediumsecondary features4 hours1 hourHourly snapshots
Lowinternal tooling24 hours24 hoursDaily backups

The workflow: generate a real PITR runbook

Section titled “The workflow: generate a real PITR runbook”

The single most valuable DR artifact for most teams is a Postgres point-in-time-recovery runbook that an on-call engineer can follow at 3 AM without thinking. The pattern that actually works in production is WAL archiving plus an external base-backup toolpgBackRest or WAL-G — not a server-side plpgsql function. PITR restores set restore_command and recovery_target_time and rely on an unbroken WAL chain between the base backup and the target time.

Have your AI tool write the runbook against your actual config rather than a generic template:

Read what it produces critically. The two things AI most often gets wrong here are inventing pgBackRest flags and glossing over the WAL-chain gap check — the exact failure that strands you mid-recovery. If a command looks unfamiliar, check it against pgbackrest --help before it goes in the runbook.

The runbook itself is identical regardless of tool — it is a Markdown file in your repo. What differs is how you drive the generation and keep it honest over time.

Open the runbook in the editor and use agent mode so Cursor can read your actual pgbackrest.conf, docker-compose.yml, and migration files for ground truth instead of guessing host names and stanza names. Iterate inline: when a step looks wrong, select it and ask Cursor to fix just that step. Use a checkpoint before letting it touch multiple files so you can roll back a bad rewrite in one click.

This is the fastest loop when DR config lives in the same repo you are editing and you want to see diffs as they happen.

A database is rarely the whole story. For Kubernetes-hosted workloads, Velero backs up cluster resources and persistent volumes and is the standard tool to prompt your AI assistant to configure — not an invented internal backup service.

For object storage and managed databases, prefer the provider’s native cross-region replication and point-in-time features (for example, RDS automated backups with PITR, or S3 Cross-Region Replication) over rolling your own. Ask your AI tool to generate the Terraform for those, then review the IaC the same way you reviewed the runbook.

Testing it: chaos drills that assert your RPO

Section titled “Testing it: chaos drills that assert your RPO”

An untested DR plan is a hypothesis. The cheapest way to test failover continuously is a scheduled Chaos Mesh experiment that kills your primary DB pod and checks that the system recovers inside your stated objectives. The Schedule CRD (apiVersion: chaos-mesh.org/v1alpha1) runs a PodChaos pod-kill on a cron.

The generated manifest is the easy half. The runbook that defines pass/fail against your RTO/RPO is the half that makes the drill meaningful — without it, you are just killing pods and hoping. Treat a missed RTO in a Saturday drill as a P2 ticket, not a curiosity.

  1. Run the drill in a staging cluster that mirrors production topology, never against live customer traffic.

  2. Capture the timeline automatically: kill time, promotion time, first successful write to the new primary.

  3. Compare measured RTO/RPO to your targets and file a ticket for any miss.

  4. Feed the drill’s logs straight into the post-mortem prompt below so the gaps turn into action items instead of being forgotten by Monday.

When the real thing happens: the human-gated playbook

Section titled “When the real thing happens: the human-gated playbook”

When you are actually mid-incident, AI is best used to generate and validate the next action — never to fire irreversible commands directly from free text. Keep a person at the approval gate for anything that promotes a replica, reroutes DNS, or disables writes.

For a ransomware scenario the same discipline applies: use the tool to find the last clean backup before encryption and to draft network-isolation rules, then have a human execute. A useful, specific Claude Code invocation:

Terminal window
claude -p "Given the pgBackRest backup catalog in this repo's logs/ \
directory and the file-modification timeline in incident/timeline.csv, \
identify the most recent backup whose stop time precedes the first \
encryption event, and explain how you ruled out later backups." \
--allowedTools "Read,Grep"

Note the claude -p headless form and the read-only tool allowlist: it analyzes and recommends, it does not restore.

Even a well-generated DR plan fails in predictable ways. Watch for these:

  • WAL chain gaps. Bucket lifecycle rules or a silently-failing archive_command expire segments between your base backup and target time. The restore aborts partway. Fix: the runbook must verify continuity, not assume it.
  • Replication lag exceeds RPO at the worst moment. Under the write spike that often precedes an outage, your standby falls minutes behind. Promoting it loses more data than your RPO allows. Fix: alert on lag against the RPO number, and have the failover step check lag before promoting.
  • Failover that strands writes. You promote the standby but the old primary is still accepting writes (split-brain), or in-flight writes never replicated. Fix: the playbook’s first step is always disable writes on the failed primary, before promotion.
  • The drill passes, the real thing fails. Staging has 3 nodes; production has 30 and a different topology. Fix: run drills against a cluster that mirrors production scale, and rotate the failure injected.
  • AI invents flags or fields. A generated pgBackRest flag or Chaos Mesh field that does not exist makes the artifact non-runnable. Fix: always validate generated commands against --help or the CRD reference before committing — treat AI output as a draft, never as gospel.

The incident is not over until the learning is captured. AI is genuinely good at turning a messy timeline and a wall of logs into a structured, blameless post-mortem — provided you feed it the real artifacts.

Run this with Claude Fable 5 (/model fable) when the causal chain is tangled across services — the reasoning quality is worth the cost on the one document everyone will read; fall back to Opus 4.8 if budget is a concern. Keep the output in the repo next to the runbook it should improve, so the next drill tests against the lessons from the last incident.