AI-Powered Disaster Recovery
Your primary region just went dark mid-deploy. The on-call channel is on fire, the status page is still green because nobody updated it, and leadership wants an ETA you do not have. Your runbook is a Confluence page last edited fourteen months ago that references a Postgres host that was decommissioned in the spring.
Disaster recovery is the one area where “we will write the docs later” quietly becomes “we lost six hours and a chunk of customer data.” The good news: the slow, tedious parts of DR — drafting precise PITR runbooks, writing chaos experiments that actually assert your RPO, and turning a messy incident timeline into a blameless post-mortem — are exactly what AI coding tools are good at, as long as you anchor them to real tools and keep a human at the approval gate.
What you’ll walk away with
Section titled “What you’ll walk away with”- A copy-paste prompt that turns your stack into a Postgres point-in-time-recovery runbook with explicit RTO/RPO targets and WAL-chain verification
- A Chaos Mesh
Schedulethat kills your primary DB pod every week and asserts failover under a 5-minute RPO — generated, not hand-typed from memory - A three-tool workflow (Cursor / Claude Code / Codex) for keeping DR artifacts in version control and re-validating them in CI
- A prompt that drafts a blameless post-mortem from your real logs and timeline
- A short list of failure modes that silently break DR plans, and how to catch them before the disaster does
Defining recovery objectives first
Section titled “Defining recovery objectives first”Every DR artifact downstream depends on two numbers per service tier: RTO (how long until service is restored) and RPO (how much data you can afford to lose). Write them down explicitly before you generate anything — a runbook that targets “fast recovery” is useless; one that targets “RTO 15 min, RPO 5 min for payments” is testable.
A realistic starting matrix looks like this:
| Tier | Example services | RTO | RPO | Backup approach |
|---|---|---|---|---|
| Critical | auth, payments | 15 min | 0–5 min | Synchronous/streaming replica + WAL archiving |
| High | core API | 1 hour | 5 min | WAL archiving + frequent base backups |
| Medium | secondary features | 4 hours | 1 hour | Hourly snapshots |
| Low | internal tooling | 24 hours | 24 hours | Daily backups |
The workflow: generate a real PITR runbook
Section titled “The workflow: generate a real PITR runbook”The single most valuable DR artifact for most teams is a Postgres point-in-time-recovery runbook that an on-call engineer can follow at 3 AM without thinking. The pattern that actually works in production is WAL archiving plus an external base-backup tool — pgBackRest or WAL-G — not a server-side plpgsql function. PITR restores set restore_command and recovery_target_time and rely on an unbroken WAL chain between the base backup and the target time.
Have your AI tool write the runbook against your actual config rather than a generic template:
Read what it produces critically. The two things AI most often gets wrong here are inventing pgBackRest flags and glossing over the WAL-chain gap check — the exact failure that strands you mid-recovery. If a command looks unfamiliar, check it against pgbackrest --help before it goes in the runbook.
Where each tool fits
Section titled “Where each tool fits”The runbook itself is identical regardless of tool — it is a Markdown file in your repo. What differs is how you drive the generation and keep it honest over time.
Open the runbook in the editor and use agent mode so Cursor can read your actual pgbackrest.conf, docker-compose.yml, and migration files for ground truth instead of guessing host names and stanza names. Iterate inline: when a step looks wrong, select it and ask Cursor to fix just that step. Use a checkpoint before letting it touch multiple files so you can roll back a bad rewrite in one click.
This is the fastest loop when DR config lives in the same repo you are editing and you want to see diffs as they happen.
Use headless mode to regenerate and validate the runbook as part of a scheduled DR drill or a pre-release check, so it never silently rots:
claude -p "Read infra/pgbackrest.conf and ops/runbooks/pitr.md. \Verify every pgBackRest flag in the runbook exists in this version, \and that the stanza name matches the config. List any drift as a \checklist of fixes. Exit non-zero if the runbook references a host or \stanza not present in the config." \ --allowedTools "Read,Grep"Wire that into a weekly CI job. A red build means your runbook drifted from reality — which is exactly when you want to find out, not during the outage. The --allowedTools "Read,Grep" flag keeps the check read-only so it can run unattended.
Hand the whole task to Codex Cloud: give it the repo and a task like “regenerate ops/runbooks/pitr.md from the current pgBackRest config and open a PR.” Running on GPT-5.5, it works in an isolated environment against a checked-out copy, so it can grep the real config, regenerate the runbook, and push a branch for review without touching your machine.
For local work, codex in a worktree keeps the DR changes isolated from your feature branches, and its GitHub integration can open the PR for a human to approve.
Backing up the rest of the stack
Section titled “Backing up the rest of the stack”A database is rarely the whole story. For Kubernetes-hosted workloads, Velero backs up cluster resources and persistent volumes and is the standard tool to prompt your AI assistant to configure — not an invented internal backup service.
For object storage and managed databases, prefer the provider’s native cross-region replication and point-in-time features (for example, RDS automated backups with PITR, or S3 Cross-Region Replication) over rolling your own. Ask your AI tool to generate the Terraform for those, then review the IaC the same way you reviewed the runbook.
Testing it: chaos drills that assert your RPO
Section titled “Testing it: chaos drills that assert your RPO”An untested DR plan is a hypothesis. The cheapest way to test failover continuously is a scheduled Chaos Mesh experiment that kills your primary DB pod and checks that the system recovers inside your stated objectives. The Schedule CRD (apiVersion: chaos-mesh.org/v1alpha1) runs a PodChaos pod-kill on a cron.
The generated manifest is the easy half. The runbook that defines pass/fail against your RTO/RPO is the half that makes the drill meaningful — without it, you are just killing pods and hoping. Treat a missed RTO in a Saturday drill as a P2 ticket, not a curiosity.
-
Run the drill in a staging cluster that mirrors production topology, never against live customer traffic.
-
Capture the timeline automatically: kill time, promotion time, first successful write to the new primary.
-
Compare measured RTO/RPO to your targets and file a ticket for any miss.
-
Feed the drill’s logs straight into the post-mortem prompt below so the gaps turn into action items instead of being forgotten by Monday.
When the real thing happens: the human-gated playbook
Section titled “When the real thing happens: the human-gated playbook”When you are actually mid-incident, AI is best used to generate and validate the next action — never to fire irreversible commands directly from free text. Keep a person at the approval gate for anything that promotes a replica, reroutes DNS, or disables writes.
For a ransomware scenario the same discipline applies: use the tool to find the last clean backup before encryption and to draft network-isolation rules, then have a human execute. A useful, specific Claude Code invocation:
claude -p "Given the pgBackRest backup catalog in this repo's logs/ \directory and the file-modification timeline in incident/timeline.csv, \identify the most recent backup whose stop time precedes the first \encryption event, and explain how you ruled out later backups." \ --allowedTools "Read,Grep"Note the claude -p headless form and the read-only tool allowlist: it analyzes and recommends, it does not restore.
When this breaks
Section titled “When this breaks”Even a well-generated DR plan fails in predictable ways. Watch for these:
- WAL chain gaps. Bucket lifecycle rules or a silently-failing
archive_commandexpire segments between your base backup and target time. The restore aborts partway. Fix: the runbook must verify continuity, not assume it. - Replication lag exceeds RPO at the worst moment. Under the write spike that often precedes an outage, your standby falls minutes behind. Promoting it loses more data than your RPO allows. Fix: alert on lag against the RPO number, and have the failover step check lag before promoting.
- Failover that strands writes. You promote the standby but the old primary is still accepting writes (split-brain), or in-flight writes never replicated. Fix: the playbook’s first step is always disable writes on the failed primary, before promotion.
- The drill passes, the real thing fails. Staging has 3 nodes; production has 30 and a different topology. Fix: run drills against a cluster that mirrors production scale, and rotate the failure injected.
- AI invents flags or fields. A generated
pgBackRestflag or Chaos Mesh field that does not exist makes the artifact non-runnable. Fix: always validate generated commands against--helpor the CRD reference before committing — treat AI output as a draft, never as gospel.
Closing the loop: the post-mortem
Section titled “Closing the loop: the post-mortem”The incident is not over until the learning is captured. AI is genuinely good at turning a messy timeline and a wall of logs into a structured, blameless post-mortem — provided you feed it the real artifacts.
Run this with Claude Fable 5 (/model fable) when the causal chain is tangled across services — the reasoning quality is worth the cost on the one document everyone will read; fall back to Opus 4.8 if budget is a concern. Keep the output in the repo next to the runbook it should improve, so the next drill tests against the lessons from the last incident.
What’s next
Section titled “What’s next”- Security Compliance — security-focused recovery planning and the ransomware angle
- Performance Testing — load-testing DR systems before you rely on them
- Monitoring & Observability — the replication-lag and backup-health alerts that catch DR gaps early
- Infrastructure as Code — keeping the failover infrastructure reproducible and reviewed