AI-Powered Disaster Recovery

Your primary region just went dark mid-deploy. The on-call channel is on fire, the status page is still green because nobody updated it, and leadership wants an ETA you do not have. Your runbook is a Confluence page last edited fourteen months ago that references a Postgres host that was decommissioned in the spring.

Disaster recovery is the one area where “we will write the docs later” quietly becomes “we lost six hours and a chunk of customer data.” The good news: the slow, tedious parts of DR — drafting precise PITR runbooks, writing chaos experiments that actually assert your RPO, and turning a messy incident timeline into a blameless post-mortem — are exactly what AI coding tools are good at, as long as you anchor them to real tools and keep a human at the approval gate.

What you’ll walk away with

A copy-paste prompt that turns your stack into a Postgres point-in-time-recovery runbook with explicit RTO/RPO targets and WAL-chain verification
A Chaos Mesh Schedule that kills your primary DB pod every week and asserts failover under a 5-minute RPO — generated, not hand-typed from memory
A three-tool workflow (Cursor / Claude Code / Codex) for keeping DR artifacts in version control and re-validating them in CI
A prompt that drafts a blameless post-mortem from your real logs and timeline
A short list of failure modes that silently break DR plans, and how to catch them before the disaster does

Defining recovery objectives first

Every DR artifact downstream depends on two numbers per service tier: RTO (how long until service is restored) and RPO (how much data you can afford to lose). Write them down explicitly before you generate anything — a runbook that targets “fast recovery” is useless; one that targets “RTO 15 min, RPO 5 min for payments” is testable.

A realistic starting matrix looks like this:

Tier	Example services	RTO	RPO	Backup approach
Critical	auth, payments	15 min	0–5 min	Synchronous/streaming replica + WAL archiving
High	core API	1 hour	5 min	WAL archiving + frequent base backups
Medium	secondary features	4 hours	1 hour	Hourly snapshots
Low	internal tooling	24 hours	24 hours	Daily backups

The workflow: generate a real PITR runbook

The single most valuable DR artifact for most teams is a Postgres point-in-time-recovery runbook that an on-call engineer can follow at 3 AM without thinking. The pattern that actually works in production is WAL archiving plus an external base-backup tool — pgBackRest or WAL-G — not a server-side plpgsql function. PITR restores set restore_command and recovery_target_time and rely on an unbroken WAL chain between the base backup and the target time.

Have your AI tool write the runbook against your actual config rather than a generic template:

Copy-paste prompt — generate a Postgres PITR runbook:

Write a point-in-time-recovery runbook for our Postgres 16 primary,
backed up with pgBackRest to S3 (stanza name: prod-main).

Targets: RTO 15 minutes, RPO 5 minutes.

The runbook must:
1. List exact pgBackRest commands to find the latest full backup whose
   stop time is at or before a given target timestamp.
2. Verify the WAL chain is intact from that base backup to the target
   time, and explicitly fail closed if there is a gap.
3. Show the recovery.signal + postgresql.auto.conf settings
   (restore_command, recovery_target_time, recovery_target_action=promote).
4. Include a post-restore verification step: row counts on the three
   highest-value tables and a check that the latest archived WAL segment
   is replayed.
5. Be a numbered checklist a sleep-deprived on-call can follow, with the
   one irreversible step (promote) called out and gated behind explicit
   confirmation.

Output as Markdown. Do not invent flags — use only real pgBackRest and
Postgres options.

Read what it produces critically. The two things AI most often gets wrong here are inventing pgBackRest flags and glossing over the WAL-chain gap check — the exact failure that strands you mid-recovery. If a command looks unfamiliar, check it against pgbackrest --help before it goes in the runbook.

Where each tool fits

The runbook itself is identical regardless of tool — it is a Markdown file in your repo. What differs is how you drive the generation and keep it honest over time.

Open the runbook in the editor and use agent mode so Cursor can read your actual pgbackrest.conf, docker-compose.yml, and migration files for ground truth instead of guessing host names and stanza names. Iterate inline: when a step looks wrong, select it and ask Cursor to fix just that step. Use a checkpoint before letting it touch multiple files so you can roll back a bad rewrite in one click.

This is the fastest loop when DR config lives in the same repo you are editing and you want to see diffs as they happen.

Use headless mode to regenerate and validate the runbook as part of a scheduled DR drill or a pre-release check, so it never silently rots:

claude -p "Read infra/pgbackrest.conf and ops/runbooks/pitr.md. \
Verify every pgBackRest flag in the runbook exists in this version, \
and that the stanza name matches the config. List any drift as a \
checklist of fixes. Exit non-zero if the runbook references a host or \
stanza not present in the config." \
  --allowedTools "Read,Grep"

Wire that into a weekly CI job. A red build means your runbook drifted from reality — which is exactly when you want to find out, not during the outage. The --allowedTools "Read,Grep" flag keeps the check read-only so it can run unattended.

Hand the whole task to Codex Cloud: give it the repo and a task like “regenerate ops/runbooks/pitr.md from the current pgBackRest config and open a PR.” Running on GPT-5.6 Sol, it works in an isolated environment against a checked-out copy, so it can grep the real config, regenerate the runbook, and push a branch for review without touching your machine.

For local work, codex in a worktree keeps the DR changes isolated from your feature branches, and its GitHub integration can open the PR for a human to approve.

Backing up the rest of the stack

A database is rarely the whole story. For Kubernetes-hosted workloads, Velero backs up cluster resources and persistent volumes and is the standard tool to prompt your AI assistant to configure — not an invented internal backup service.

Copy-paste prompt — generate a Velero backup schedule:

Generate a Velero Schedule manifest that backs up the "payments" and
"auth" namespaces every 6 hours, snapshots their PersistentVolumes,
excludes Secrets of type kubernetes.io/service-account-token, and sets
a 30-day TTL. Then give me the exact `velero backup describe` and
`velero restore create` commands to validate a restore into a scratch
namespace. Use only real Velero CRD fields and CLI flags.

For object storage and managed databases, prefer the provider’s native cross-region replication and point-in-time features (for example, RDS automated backups with PITR, or S3 Cross-Region Replication) over rolling your own. Ask your AI tool to generate the Terraform for those, then review the IaC the same way you reviewed the runbook.

Testing it: chaos drills that assert your RPO

An untested DR plan is a hypothesis. The cheapest way to test failover continuously is a scheduled Chaos Mesh experiment that kills your primary DB pod and checks that the system recovers inside your stated objectives. The Schedule CRD (apiVersion: chaos-mesh.org/v1alpha1) runs a PodChaos pod-kill on a cron.

Copy-paste prompt — generate a weekly failover drill:

Write a Chaos Mesh Schedule (apiVersion chaos-mesh.org/v1alpha1) that:
- runs every Saturday at 03:00 (cron "0 3 * * 6")
- uses type "PodChaos" with action pod-kill, mode one
- targets pods with label app=postgresql-primary
- sets concurrencyPolicy Forbid and historyLimit 5

Then write a follow-up runbook (Markdown) describing how I assert the
drill passed: how to confirm the standby was promoted, how to measure
actual recovery time against a 15-minute RTO, and how to measure
replication lag at kill time against a 5-minute RPO. Use only real
Chaos Mesh CRD fields.

The generated manifest is the easy half. The runbook that defines pass/fail against your RTO/RPO is the half that makes the drill meaningful — without it, you are just killing pods and hoping. Treat a missed RTO in a Saturday drill as a P2 ticket, not a curiosity.

Run the drill in a staging cluster that mirrors production topology, never against live customer traffic.
Capture the timeline automatically: kill time, promotion time, first successful write to the new primary.
Compare measured RTO/RPO to your targets and file a ticket for any miss.
Feed the drill’s logs straight into the post-mortem prompt below so the gaps turn into action items instead of being forgotten by Monday.

When the real thing happens: the human-gated playbook

When you are actually mid-incident, AI is best used to generate and validate the next action — never to fire irreversible commands directly from free text. Keep a person at the approval gate for anything that promotes a replica, reroutes DNS, or disables writes.

Copy-paste prompt — triage a live regional outage:

We have lost us-east-1. Read ops/runbooks/pitr.md and infra/dns.tf.

Produce an ordered recovery checklist to fail payments, auth, and the
core API over to us-west-2, where each step states: the exact command,
whether it is reversible, and what to verify before moving on. Flag
every irreversible step (replica promotion, DNS cutover, disabling
writes in the failed region) as REQUIRES HUMAN APPROVAL. Do not run
anything — output the checklist only.

For a ransomware scenario the same discipline applies: use the tool to find the last clean backup before encryption and to draft network-isolation rules, then have a human execute. A useful, specific Claude Code invocation:

claude -p "Given the pgBackRest backup catalog in this repo's logs/ \
directory and the file-modification timeline in incident/timeline.csv, \
identify the most recent backup whose stop time precedes the first \
encryption event, and explain how you ruled out later backups." \
  --allowedTools "Read,Grep"

Note the claude -p headless form and the read-only tool allowlist: it analyzes and recommends, it does not restore.

When this breaks

Even a well-generated DR plan fails in predictable ways. Watch for these:

WAL chain gaps. Bucket lifecycle rules or a silently-failing archive_command expire segments between your base backup and target time. The restore aborts partway. Fix: the runbook must verify continuity, not assume it.
Replication lag exceeds RPO at the worst moment. Under the write spike that often precedes an outage, your standby falls minutes behind. Promoting it loses more data than your RPO allows. Fix: alert on lag against the RPO number, and have the failover step check lag before promoting.
Failover that strands writes. You promote the standby but the old primary is still accepting writes (split-brain), or in-flight writes never replicated. Fix: the playbook’s first step is always disable writes on the failed primary, before promotion.
The drill passes, the real thing fails. Staging has 3 nodes; production has 30 and a different topology. Fix: run drills against a cluster that mirrors production scale, and rotate the failure injected.
AI invents flags or fields. A generated pgBackRest flag or Chaos Mesh field that does not exist makes the artifact non-runnable. Fix: always validate generated commands against --help or the CRD reference before committing — treat AI output as a draft, never as gospel.

Closing the loop: the post-mortem

The incident is not over until the learning is captured. AI is genuinely good at turning a messy timeline and a wall of logs into a structured, blameless post-mortem — provided you feed it the real artifacts.

Copy-paste prompt — draft a blameless post-mortem:

Draft a blameless post-mortem from the attached incident timeline and
logs (incident/timeline.csv, logs/incident-2026-06-05/).

Structure: one-paragraph executive summary, a minute-by-minute timeline,
root cause (distinguish trigger from underlying cause), customer impact
(duration, requests failed, estimated data loss), what went well, what
went wrong, and 3-5 concrete action items each with an owner role and a
priority. Keep it blameless: describe systems and decisions, never
individuals. Flag any place where the logs are insufficient to determine
what happened.

Run this with Claude Fable 5 (/model fable) when the causal chain is tangled across services — the reasoning quality is worth the cost on the one document everyone will read; fall back to Opus 5 if budget is a concern. Keep the output in the repo next to the runbook it should improve, so the next drill tests against the lessons from the last incident.

What’s next

Security Compliance — security-focused recovery planning and the ransomware angle
Performance Testing — load-testing DR systems before you rely on them
Monitoring & Observability — the replication-lag and backup-health alerts that catch DR gaps early
Infrastructure as Code — keeping the failover infrastructure reproducible and reviewed