Application / local injection
Inject faults in-process or at a local proxy (Toxiproxy, nock, fault-injecting HTTP clients). Fast, deterministic, runs in CI. Best for proving a single service’s timeout, retry, and fallback logic.
Your payment service degrades silently when PostgreSQL gets slow under load. The connection pool fills, requests queue, timeouts you assumed existed never fire, and the first signal you get is angry customer tweets twenty minutes after the latency spike started. The happy-path tests all pass. The bug only shows up when a dependency misbehaves — which is exactly the condition you never test.
Error injection (also called fault injection) fixes this by deliberately breaking dependencies in a controlled way: slow down a database query, kill a downstream pod, drop packets between two services, or make an external API return 500s on demand. Chaos engineering is the systems-level extension of the same idea — running those faults against a live (usually staging) system to verify it holds. This guide shows how to use AI to generate the experiments, the runbooks, and the resilience code, against a concrete stack rather than hand-writing YAML from memory.
Fault injection happens at two layers, and you want AI help at both.
Application / local injection
Inject faults in-process or at a local proxy (Toxiproxy, nock, fault-injecting HTTP clients). Fast, deterministic, runs in CI. Best for proving a single service’s timeout, retry, and fallback logic.
System / chaos injection
Inject faults at the infrastructure layer (Chaos Mesh, LitmusChaos) against a running cluster: pod kills, network delay, partitions, resource stress. Best for verifying recovery, blast radius, and steady-state behavior across services.
Start at the application layer because it is cheap and deterministic, then graduate to system-level chaos once individual services are proven resilient.
The fastest way to prove a service handles a slow dependency is to put a controllable proxy in front of it. Toxiproxy sits between your app and PostgreSQL (or Redis, or any TCP service) and lets you add latency, cut bandwidth, or sever the connection on demand.
A minimal real setup for a Node/Express service talking to PostgreSQL:
import { Toxiproxy } from 'toxiproxy-node-client';import { Pool } from 'pg';
const toxiproxy = new Toxiproxy('http://localhost:8474');
// App connects to PG *through* the proxy on 5433, not directly on 5432.const pool = new Pool({ host: 'localhost', port: 5433, database: 'app' });
test('order creation fails fast when the DB is slow, instead of hanging', async () => { const proxy = await toxiproxy.get('postgres'); // Inject 4s of latency on every query through the proxy. await proxy.addToxic({ type: 'latency', attributes: { latency: 4000 } });
const start = Date.now(); // statement_timeout / pool timeout should trip well before 4s. await expect(createOrder(pool, { sku: 'ABC', qty: 1 })).rejects.toThrow(/timeout/i); expect(Date.now() - start).toBeLessThan(3000);
await proxy.removeToxic('latency'); // always clean up});The point is not the proxy plumbing — it is that you now have a deterministic way to assert “this code fails fast.” Let AI generate the experiment matrix and the assertions for your real service.
Open the service under test plus its data-access layer, then use Agent mode so Cursor can see your real timeout configuration and pool settings.
Claude Code can generate the suite, run it against a local Toxiproxy container, and iterate until the assertions pass — which surfaces missing timeouts immediately.
Codex’s strength here is multi-surface: kick off a cloud task that opens a PR with the resilience suite plus any timeout fixes it had to make, then review the diff before merging. The same prompt runs from the Codex CLI (codex "...") or the IDE extension if you prefer to stay local.
Once individual services fail fast, verify the cluster recovers. Chaos Mesh is a Kubernetes-native chaos platform whose experiments are plain CRDs — which means AI can generate them accurately and you can review them like any other manifest. A single NetworkChaos that delays traffic to your database for ten minutes:
apiVersion: chaos-mesh.org/v1alpha1kind: NetworkChaosmetadata: name: postgres-latency namespace: stagingspec: action: delay mode: all selector: namespaces: - staging labelSelectors: app: postgresql delay: latency: '200ms' jitter: '50ms' duration: '10m'Apply it with kubectl apply -f chaos/postgres-latency.yaml, watch your SLO dashboards, and remove it with kubectl delete -f (or let duration expire). Use AI to expand a single fault into a staged experiment with safety controls.
Reference your existing Kubernetes manifests so Cursor uses your real namespaces, labels, and service names instead of placeholders.
Claude Code can generate the manifests and the runbook, then validate them with kubectl --dry-run so you catch schema errors before applying anything to a cluster.
Use a Codex cloud task to land the chaos manifests, the runbook, and the CI wiring as one reviewable PR. Because Codex runs across App, CLI, IDE, and Cloud, you can start the task from Slack or the web and review the diff in the IDE.
Injecting a fault is only half the test. The other half is asserting that your timeout, retry, and circuit-breaker logic actually engaged. A real circuit-breaker test using a fault-injecting mock for an external API:
import nock from 'nock';import { getInventory } from '../../src/clients/inventory.client.js';
test('circuit opens after 5 consecutive failures, then serves the fallback', async () => { // Make the external API fail every call. nock('https://inventory.internal').get(/.*/).times(10).reply(500);
// Drive enough failures to trip the breaker (threshold = 5). for (let i = 0; i < 5; i++) { await expect(getInventory('SKU-1')).rejects.toThrow(); }
// 6th call should short-circuit and return the cached fallback, // NOT hit the network again. const result = await getInventory('SKU-1'); expect(result.source).toBe('fallback-cache'); expect(nock.isDone()).toBe(false); // breaker prevented the 6th network call});Run the cheap application-layer fault tests on every PR, and schedule heavier chaos experiments against staging on a cadence. A scheduled workflow that runs the resilience suite and can also fire a single named chaos experiment on demand:
name: Continuous Fault Injection
on: schedule: - cron: '0 */6 * * *' # every 6 hours workflow_dispatch: inputs: experiment: description: 'Chaos Mesh manifest to apply (staging only)' required: false default: 'postgres-latency'
jobs: resilience-tests: runs-on: ubuntu-latest services: toxiproxy: image: ghcr.io/shopify/toxiproxy ports: - 8474:8474 - 5433:5433 steps: - uses: actions/checkout@v5 - uses: actions/setup-node@v4 with: node-version: 22 - run: npm ci - name: Run fault-injection resilience suite run: npx jest test/resilience/
staging-chaos: if: github.event_name == 'workflow_dispatch' runs-on: ubuntu-latest steps: - uses: actions/checkout@v5 - name: Pre-flight - abort if staging is unhealthy run: ./scripts/preflight.sh --slo-status green --active-incidents none - name: Apply chaos experiment (auto-expires via spec.duration) run: kubectl apply -n staging -f chaos/${{ github.event.inputs.experiment }}.yamlChaos experiments have their own failure modes. These are the ones that bite teams hardest.
Map dependencies and define steady state. List every external call (DB, cache, queue, third-party API) and the SLO each must hold under stress.
Inject at the application layer first. Use Toxiproxy and nock to prove each service’s timeouts, retries, and fallbacks fire deterministically in CI.
Graduate to system-level chaos. Once services fail fast, run Chaos Mesh experiments against staging to verify recovery and blast radius.
Codify safety. Bound every experiment with duration, scope selectors to staging, and keep abort commands in a runbook.
Automate the cadence. Run cheap fault tests on every PR; schedule heavier chaos against staging and gate it behind pre-flight health checks.