Error Injection and Fault Injection Testing

Your payment service degrades silently when PostgreSQL gets slow under load. The connection pool fills, requests queue, timeouts you assumed existed never fire, and the first signal you get is angry customer tweets twenty minutes after the latency spike started. The happy-path tests all pass. The bug only shows up when a dependency misbehaves — which is exactly the condition you never test.

Error injection (also called fault injection) fixes this by deliberately breaking dependencies in a controlled way: slow down a database query, kill a downstream pod, drop packets between two services, or make an external API return 500s on demand. Chaos engineering is the systems-level extension of the same idea — running those faults against a live (usually staging) system to verify it holds. This guide shows how to use AI to generate the experiments, the runbooks, and the resilience code, against a concrete stack rather than hand-writing YAML from memory.

What You’ll Walk Away With

A repeatable workflow for injecting latency, dependency failures, and resource pressure with Cursor, Claude Code, and Codex
Real, runnable artifacts: a Toxiproxy setup for local fault injection and a Chaos Mesh manifest for Kubernetes
Copy-paste prompts that name your stack and produce experiments plus rollback-aware runbooks, not generic advice
A test that proves your timeouts, retries, and circuit breakers actually fire under failure
A “When This Breaks” checklist for the failure modes of chaos experiments themselves

The Two Layers of Fault Injection

Fault injection happens at two layers, and you want AI help at both.

Application / local injection

Inject faults in-process or at a local proxy (Toxiproxy, nock, fault-injecting HTTP clients). Fast, deterministic, runs in CI. Best for proving a single service’s timeout, retry, and fallback logic.

System / chaos injection

Inject faults at the infrastructure layer (Chaos Mesh, LitmusChaos) against a running cluster: pod kills, network delay, partitions, resource stress. Best for verifying recovery, blast radius, and steady-state behavior across services.

Start at the application layer because it is cheap and deterministic, then graduate to system-level chaos once individual services are proven resilient.

Workflow 1: Inject Dependency Latency Locally with Toxiproxy

The fastest way to prove a service handles a slow dependency is to put a controllable proxy in front of it. Toxiproxy sits between your app and PostgreSQL (or Redis, or any TCP service) and lets you add latency, cut bandwidth, or sever the connection on demand.

A minimal real setup for a Node/Express service talking to PostgreSQL:

import { Toxiproxy } from 'toxiproxy-node-client';
import { Pool } from 'pg';

const toxiproxy = new Toxiproxy('http://localhost:8474');

// App connects to PG *through* the proxy on 5433, not directly on 5432.
const pool = new Pool({ host: 'localhost', port: 5433, database: 'app' });

test('order creation fails fast when the DB is slow, instead of hanging', async () => {
  const proxy = await toxiproxy.get('postgres');
  // Inject 4s of latency on every query through the proxy.
  await proxy.addToxic({ type: 'latency', attributes: { latency: 4000 } });

  const start = Date.now();
  // statement_timeout / pool timeout should trip well before 4s.
  await expect(createOrder(pool, { sku: 'ABC', qty: 1 })).rejects.toThrow(/timeout/i);
  expect(Date.now() - start).toBeLessThan(3000);

  await proxy.removeToxic('latency'); // always clean up
});

The point is not the proxy plumbing — it is that you now have a deterministic way to assert “this code fails fast.” Let AI generate the experiment matrix and the assertions for your real service.

Open the service under test plus its data-access layer, then use Agent mode so Cursor can see your real timeout configuration and pool settings.

Copy-paste prompt for generating a Toxiproxy resilience suite:

@src/services/order.service.ts @src/db/pool.ts

Write a Jest resilience test suite using toxiproxy-node-client that routes
PostgreSQL traffic through a Toxiproxy proxy named "postgres" (app connects on
port 5433). Cover these scenarios, each as its own test that adds the toxic,
asserts behavior, then removes the toxic in afterEach:

1. latency toxic (4000ms): createOrder must reject with a timeout error in
   under 3s, proving statement_timeout/pool acquireTimeout fire.
2. timeout toxic (1000ms then connection cut): the pool must surface a
   connection error, not hang, and must not leak a checked-out client.
3. bandwidth toxic (rate 1KB/s) during a large read: the request must respect
   the 5s request timeout and return a 503, not a partial response.
4. reset_peer toxic: verify the retry policy retries exactly twice with
   backoff, then gives up.

Use the existing timeout values from pool.ts in the assertions - do not invent
new numbers. Follow Arrange-Act-Assert and use descriptive test names.

Claude Code can generate the suite, run it against a local Toxiproxy container, and iterate until the assertions pass — which surfaces missing timeouts immediately.

Copy-paste prompt for generate-run-fix in one pass:

claude "Read src/services/order.service.ts and src/db/pool.ts.

Generate a Jest resilience suite at test/resilience/db-faults.test.js using
toxiproxy-node-client. Route PostgreSQL through a Toxiproxy proxy 'postgres'
on port 5433. Cover latency (4s), connection reset, and bandwidth throttling,
asserting the service fails fast within its configured timeouts and never
leaks a pooled client.

Then start Toxiproxy with:
  docker run -d -p 8474:8474 -p 5433:5433 ghcr.io/shopify/toxiproxy
and create the 'postgres' proxy via the admin API on :8474 pointing at the
local Postgres on 5432.

Run: npx jest test/resilience/db-faults.test.js
If a test fails because a timeout is missing in the code, add the missing
statement_timeout / acquireTimeout to pool.ts, then re-run until green."

Codex’s strength here is multi-surface: kick off a cloud task that opens a PR with the resilience suite plus any timeout fixes it had to make, then review the diff before merging. The same prompt runs from the Codex CLI (codex "...") or the IDE extension if you prefer to stay local.

Copy-paste prompt for a Codex cloud task that ships a PR:

Add a fault-injection resilience suite for the order service.

1. Read src/services/order.service.ts and src/db/pool.ts to learn the current
   timeout and retry configuration.
2. Add test/resilience/db-faults.test.js using toxiproxy-node-client. Route
   Postgres through a Toxiproxy proxy 'postgres' on port 5433. Inject latency
   (4s), connection reset, and bandwidth limits; assert fail-fast within the
   existing timeouts and no leaked pool clients.
3. Add a docker-compose.test.yml service for Toxiproxy and wire it into the
   existing CI test job.
4. If any required timeout is missing in pool.ts, add it.

Open a PR titled "test: dependency fault injection for order service" with a
summary of every timeout you added or relied on.

Workflow 2: System-Level Chaos with Chaos Mesh

Once individual services fail fast, verify the cluster recovers. Chaos Mesh is a Kubernetes-native chaos platform whose experiments are plain CRDs — which means AI can generate them accurately and you can review them like any other manifest. A single NetworkChaos that delays traffic to your database for ten minutes:

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: postgres-latency
  namespace: staging
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - staging
    labelSelectors:
      app: postgresql
  delay:
    latency: '200ms'
    jitter: '50ms'
  duration: '10m'

Apply it with kubectl apply -f chaos/postgres-latency.yaml, watch your SLO dashboards, and remove it with kubectl delete -f (or let duration expire). Use AI to expand a single fault into a staged experiment with safety controls.

Reference your existing Kubernetes manifests so Cursor uses your real namespaces, labels, and service names instead of placeholders.

Copy-paste prompt for a staged Chaos Mesh workflow:

@k8s/staging/

Generate a Chaos Mesh Schedule + Workflow CRD (chaos/progressive.yaml) for the
staging namespace that runs four stages in sequence, each followed by a 5-minute
recovery gap:

1. NetworkChaos delay (200ms +/- 50ms) on app=postgresql for 10m.
2. PodChaos pod-kill, mode fixed-percent value 50%, on app=payment-service
   for 5m - verify the deployment self-heals.
3. StressChaos memory stressor (4 workers, 256MB) on app=order-service for 8m.
4. NetworkChaos partition (direction both) on tier=backend for 3m.

Use the exact label selectors from the manifests I referenced. Add a top
comment documenting the steady-state SLO each stage validates (p99 latency,
error rate) and the kubectl command to abort the whole workflow.

Claude Code can generate the manifests and the runbook, then validate them with kubectl --dry-run so you catch schema errors before applying anything to a cluster.

Copy-paste prompt for chaos manifests plus a rollback runbook:

claude "Read k8s/staging/ to learn our service labels and namespaces.

Generate chaos/ manifests for Chaos Mesh covering: Postgres network delay,
payment-service pod-kill (50%), order-service memory stress, and a backend
network partition - all scoped to namespace 'staging' with explicit durations.

Then write chaos/RUNBOOK.md with: the steady-state baseline to capture before
each experiment, the SLO thresholds that should trigger an abort (p99 > 2s or
error rate > 1%), the exact 'kubectl delete' abort commands, and how to confirm
full recovery afterward.

Validate every manifest with: kubectl apply --dry-run=client -f chaos/
Fix any schema errors and re-run until all manifests validate."

Use a Codex cloud task to land the chaos manifests, the runbook, and the CI wiring as one reviewable PR. Because Codex runs across ChatGPT desktop, CLI, IDE, and Cloud, you can start the task from Slack or the web and review the diff in the IDE.

Copy-paste prompt for a Codex PR with chaos manifests:

Read k8s/staging/ for our service labels and namespaces.

Create a chaos/ directory containing Chaos Mesh manifests for: Postgres network
delay (200ms), payment-service pod-kill (50%), order-service memory stress, and
a backend network partition - all in the staging namespace with bounded
durations.

Add chaos/RUNBOOK.md documenting the steady-state baseline, abort SLO
thresholds, and rollback commands. Add a manual-dispatch GitHub Actions job
(.github/workflows/chaos.yml) that applies a single named experiment against
staging only and auto-deletes it after the duration.

Open a PR titled "chaos: staging fault-injection experiments + runbook".

Verify the Resilience Patterns Fire

Injecting a fault is only half the test. The other half is asserting that your timeout, retry, and circuit-breaker logic actually engaged. A real circuit-breaker test using a fault-injecting mock for an external API:

import nock from 'nock';
import { getInventory } from '../../src/clients/inventory.client.js';

test('circuit opens after 5 consecutive failures, then serves the fallback', async () => {
  // Make the external API fail every call.
  nock('https://inventory.internal').get(/.*/).times(10).reply(500);

  // Drive enough failures to trip the breaker (threshold = 5).
  for (let i = 0; i < 5; i++) {
    await expect(getInventory('SKU-1')).rejects.toThrow();
  }

  // 6th call should short-circuit and return the cached fallback,
  // NOT hit the network again.
  const result = await getInventory('SKU-1');
  expect(result.source).toBe('fallback-cache');
  expect(nock.isDone()).toBe(false); // breaker prevented the 6th network call
});

Copy-paste prompt for circuit-breaker verification:

@src/clients/inventory.client.ts

This client wraps the inventory API with a circuit breaker (opossum). Write a
Jest test using nock that proves the breaker behaves correctly:

1. After N consecutive 500s (read N from the breaker config, do not hardcode),
   the breaker opens and the next call returns the fallback WITHOUT a network
   request - assert with nock.isDone() === false.
2. After the resetTimeout, the breaker goes half-open: one probe request is
   allowed; on success it closes, on failure it re-opens.
3. Slow responses (use nock delay > the breaker timeout) count as failures.

Assert on the breaker's emitted events ('open', 'halfOpen', 'close') rather
than on internal state.

CI/CD Integration: Continuous Fault Injection

Run the cheap application-layer fault tests on every PR, and schedule heavier chaos experiments against staging on a cadence. A scheduled workflow that runs the resilience suite and can also fire a single named chaos experiment on demand:

name: Continuous Fault Injection

on:
  schedule:
    - cron: '0 */6 * * *' # every 6 hours
  workflow_dispatch:
    inputs:
      experiment:
        description: 'Chaos Mesh manifest to apply (staging only)'
        required: false
        default: 'postgres-latency'

jobs:
  resilience-tests:
    runs-on: ubuntu-latest
    services:
      toxiproxy:
        image: ghcr.io/shopify/toxiproxy
        ports:
          - 8474:8474
          - 5433:5433
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-node@v4
        with:
          node-version: 22
      - run: npm ci
      - name: Run fault-injection resilience suite
        run: npx jest test/resilience/

  staging-chaos:
    if: github.event_name == 'workflow_dispatch'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - name: Pre-flight - abort if staging is unhealthy
        run: ./scripts/preflight.sh --slo-status green --active-incidents none
      - name: Apply chaos experiment (auto-expires via spec.duration)
        run: kubectl apply -n staging -f chaos/${{ github.event.inputs.experiment }}.yaml

When This Breaks

Chaos experiments have their own failure modes. These are the ones that bite teams hardest.

The experiment will not roll back. A PodChaos or NetworkChaos that loses its controller can outlive its intended window. Always set spec.duration so faults auto-expire, and keep the abort command in your runbook: kubectl delete networkchaos <name> -n staging. For Toxiproxy, remove toxics in an afterEach so a failed assertion never leaves the proxy poisoned for the next test.

The blast radius escaped staging. A label selector that is too broad (or a missing namespaces filter) can match production pods. Always scope selector.namespaces explicitly and run experiments from a context that has no production credentials. Review every generated manifest’s selector before applying.

The steady-state baseline drifted. If you compare against a stale baseline, a passing experiment can hide a real regression. Capture the baseline (p99 latency, error rate) immediately before each run, not from last week’s dashboard.

Timeouts “pass” because nothing was actually injected. A Toxiproxy test where the app connects directly to 5432 instead of the proxy on 5433 will pass while testing nothing. Assert that the fault had an observable effect (elapsed time, error type) so a no-op proxy fails loudly.

The fault is real but the assertion is vacuous. rejects.toThrow() passes for any error, including a typo. Assert on the specific error type or message you expect under failure, and add the elapsed-time bound so “failed fast” is actually verified.

Essential Tools and Resources

Chaos Mesh Kubernetes-native chaos platform; experiments are plain CRDs, easy for AI to generate and for you to review.

LitmusChaos CNCF chaos engineering framework with a large experiment hub and GitOps-friendly workflows.

Toxiproxy Shopify's TCP proxy for deterministic latency, bandwidth, and connection-failure injection in tests and CI.

Principles of Chaos Engineering The foundational definition of steady state, hypothesis, blast radius, and minimizing harm.

Your Fault-Injection Journey

Map dependencies and define steady state. List every external call (DB, cache, queue, third-party API) and the SLO each must hold under stress.
Inject at the application layer first. Use Toxiproxy and nock to prove each service’s timeouts, retries, and fallbacks fire deterministically in CI.
Graduate to system-level chaos. Once services fail fast, run Chaos Mesh experiments against staging to verify recovery and blast radius.
Codify safety. Bound every experiment with duration, scope selectors to staging, and keep abort commands in a runbook.
Automate the cadence. Run cheap fault tests on every PR; schedule heavier chaos against staging and gate it behind pre-flight health checks.

What’s Next

Integration Testing Test the service interactions and database state that fault injection then stresses.

E2E Testing Verify user journeys survive the degraded states you inject.

Performance Testing Load and benchmark services so chaos experiments run against realistic traffic.

Security Testing Probe failure modes for the security regressions they can expose.