Testing Integration

Your checkout flow passes every unit test, the coverage badge says 94%, and it still 500s in production because nothing ever exercised the path where the payments provider times out mid-request. The tests were green. They were also testing the wrong thing — mostly your mocks asserting that your mocks were called. What you need is a suite that catches integration bugs, not a wall of green checkmarks that lie to you.

This is where an AI coding agent earns its keep. Not by generating a hundred trivial expect(sum(1,2)).toBe(3) tests, but by reading your real code, finding the failure modes you skipped, and writing tests that fail for the right reasons. This article shows the workflow across Cursor, Claude Code, and Codex.

What You’ll Walk Away With

A TDD loop where the agent writes a failing test first, then the implementation — so you know the test can actually fail
A reusable prompt that generates integration tests for an Express + Drizzle route, including the DB-failure path that asserts a real 503
A Playwright E2E prompt that survives in CI instead of flaking on the third run
A repeatable way to fix flaky tests by attacking the root cause, not papering over it with sleep()
A headless claude -p test command and a .claude/commands recipe you can drop into a real repo today

The Running Example

We will test one real endpoint throughout: an Express route that creates an order, backed by Drizzle ORM on Postgres. Nothing here is a toy — it has the two things that break in production: an external call (the payment provider) and a database write that can fail.

import { Router } from 'express';
import { db } from '../db';
import { orders } from '../db/schema';
import { charge } from '../lib/payments';

export const ordersRouter = Router();

ordersRouter.post('/orders', async (req, res) => {
  const { userId, amountCents, idempotencyKey } = req.body;
  if (!userId || !amountCents) {
    return res.status(400).json({ error: 'userId and amountCents required' });
  }

  try {
    const payment = await charge({ amountCents, idempotencyKey });
    const [order] = await db
      .insert(orders)
      .values({ userId, amountCents, paymentId: payment.id, status: 'paid' })
      .returning();
    return res.status(201).json(order);
  } catch (err) {
    if (err instanceof PaymentError) return res.status(402).json({ error: 'payment_failed' });
    // DB write failed after a successful charge -- the dangerous case
    return res.status(503).json({ error: 'order_persist_failed', retryable: true });
  }
});

The interesting test is not “201 on the happy path.” It is: the charge succeeded but the DB insert threw — do we return a retryable 503, and do we avoid double-charging on retry? That is the bug that pages you at 2am.

The TDD Loop That Actually Catches Bugs

The discipline that makes AI-generated tests trustworthy is simple: make the test fail before you let the agent implement anything. A test that has never been red is not a test, it is a comment. Drive the agent through red → green → refactor explicitly.

Write the failing test first. Tell the agent to write the test for a behavior that does not exist yet, and to stop before implementing.
Run it and confirm it fails for the right reason. Not a typo, not a missing import — a genuine assertion failure or a 404 on a route you have not built.
Implement to green. Let the agent write the minimum code to pass, with an explicit instruction not to touch the test.
Refactor under a green bar. Now the suite is your safety net for cleanup.

The mechanics of running that loop differ per tool. The prompt is nearly identical; how you keep the agent honest is not.

Use Agent mode and lean on checkpoints. Before the implement step, the failing test is a natural checkpoint — if the agent “fixes” the test instead of the code (a classic failure), restore to that checkpoint and re-prompt.

In Composer, with orders.ts and the empty orders.test.ts in context:

Write a Vitest integration test for POST /orders covering the case where
charge() resolves but the Drizzle insert rejects. Assert a 503 with
{ retryable: true }. Do NOT implement the route yet -- the test must fail.

Run npm run test in Cursor’s terminal, watch it go red, then start a new prompt: “Now make this pass without editing the test file.” Keep “Iterate on lints” on so type errors get fixed in the same turn.

Claude Code shines here because you can gate the loop with permissions and run it in CI. Restrict the first step to writing tests only:

claude -p "Write a Vitest integration test for POST /orders: charge()
succeeds but the Drizzle insert rejects, expect 503 with retryable:true.
Do not implement the route." \
  --allowedTools "Read" "Write" "Bash(npm run test*)"

Because Edit on src/routes/ is not in --allowedTools, the agent physically cannot “fix” the route to make a half-baked test pass — it has to write a test that fails honestly. Then drop the gate and run the implement step interactively.

In the Codex TUI, make the writable sandbox and interactive approval policy explicit:

codex --sandbox workspace-write -c approval_policy=on-request

Prompt it to write the failing test, let it run the suite (Codex executes the command in its sandbox), and review the red output before approving the implementation diff. workspace-write permits workspace edits, while on-request still asks when the policy requires approval.

This is the prompt that makes the whole loop work. It is opinionated on purpose — it names the framework, the exact failure mode, and forbids the agent from cheating.

Write ONE Vitest integration test for the Express route POST /orders.
Scenario: charge() resolves successfully, but db.insert(...).returning()
rejects with a thrown error (simulate a lost DB connection by mocking the
Drizzle call to throw). Assert the response is HTTP 503 and the JSON body
is { error: 'order_persist_failed', retryable: true }.

Constraints:
- Use supertest against the Express app, not a unit test of the handler.
- Mock ONLY the payment provider and the db insert; do not mock express.
- Do NOT implement or modify the route. The test must fail when run now.
Then run `npm run test` and show me the failure output.

Integration Tests: Test the Seams, Not the Mocks

The failure in the opening scenario happened at a seam — the boundary between your code and an external service. Over-mocked suites pass precisely because they never touch those seams. The fix is to mock at the edges (the HTTP boundary of the payment provider, the failure behavior of the DB) and run everything in between for real.

A strong integration-test prompt is specific about three things: what to mock, what to run for real, and which error paths are mandatory.

Generate a Vitest + supertest integration test file for POST /orders.
Run the real Express middleware stack and the real route handler. Mock
only two things:
1. The payment provider (`charge`) — give it a success case and a case
   that throws PaymentError.
2. The Drizzle insert — success, and a thrown connection error.

Cover exactly these cases, one `it()` each:
- 201 + persisted order on the happy path
- 400 when amountCents is missing
- 402 { error: 'payment_failed' } when charge throws PaymentError
- 503 { retryable: true } when the insert throws AFTER a successful charge
- idempotency: two POSTs with the same idempotencyKey charge once

Use beforeEach to reset mocks. No snapshot assertions — assert status
codes and specific body fields. Then run the suite and report results.

Here is the shape of what a good agent produces for the dangerous case — note it asserts behavior (status, body, call count), never implementation details:

import request from 'supertest';
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { app } from '../app';
import * as payments from '../lib/payments';
import { db } from '../db';

beforeEach(() => vi.restoreAllMocks());

it('returns retryable 503 when the DB write fails after a charge', async () => {
  vi.spyOn(payments, 'charge').mockResolvedValue({ id: 'pay_123' });
  vi.spyOn(db, 'insert').mockImplementation(() => {
    throw new Error('connection terminated');
  });

  const res = await request(app)
    .post('/orders')
    .send({ userId: 'u1', amountCents: 4999, idempotencyKey: 'k1' });

  expect(res.status).toBe(503);
  expect(res.body).toMatchObject({ retryable: true });
  expect(payments.charge).toHaveBeenCalledTimes(1);
});

Where MCP servers change the integration story

If you spin up a real Postgres for integration tests instead of mocking the DB, the Postgres MCP server (@modelcontextprotocol/server-postgres) lets the agent inspect your live schema and write tests that match real column constraints, not its guess at your schema. Connect it once and the prompt changes from “assume a schema” to “read the orders table and assert against its real NOT NULL constraints.” For browser-driven E2E, the Playwright MCP (@playwright/mcp) lets the agent drive a real page and read the DOM while it writes the test, instead of inventing selectors.

End-to-End: The Tests That Flake in CI

E2E tests fail in CI for one reason more than any other: the test races the application. The agent clicked before the button was interactive, asserted before the network call resolved, or relied on a fixed waitForTimeout. The cure is to ban arbitrary waits and force web-first assertions and accessible locators.

Write a Playwright test for the checkout flow: browse to /products, add
the first item to the cart, apply coupon SAVE10, complete checkout with
the test card 4242 4242 4242 4242, and assert the order-confirmation page
shows an order number.

Hard rules:
- NEVER use page.waitForTimeout. Use web-first assertions
  (await expect(locator).toBeVisible()) and auto-waiting actions.
- Locate elements by role and accessible name (getByRole) or data-testid,
  never by brittle CSS like nth-child.
- Use expect(page).toHaveURL(/\/order\/) to wait for navigation.
- Add a test.step() around each phase so failures point to the phase.
Run it 3 times in a row to prove it is not flaky.

The “run it 3 times” instruction at the end is doing real work: it turns a one-shot generation into a stability check before the test ever reaches your pipeline.

Wiring Tests Into CI

The payoff of headless agents is that test generation and triage can run in CI, not just on your laptop. The three tools take different routes.

Cursor is IDE-first, so the CI half is your normal test runner — Cursor’s value is authoring. A realistic GitHub Actions job that runs the Vitest suite Cursor helped you write:

name: test
on: [pull_request]
jobs:
  vitest:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-node@v5
        with:
          node-version: 22
          cache: npm
      - run: npm ci
      - run: npm run test -- --coverage

Use Cursor’s Background Agent to draft tests for a new route on a branch while you keep working, then review the diff before it hits this pipeline.

Run Claude Code headlessly to triage a failing suite on every PR and post a structured summary. --output-format json makes the result machine-readable:

name: test-triage
on: [pull_request]
jobs:
  triage:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v5
      - uses: actions/setup-node@v5
        with:
          node-version: 22
      - run: npm ci
      - run: |
          claude -p "Run npm run test. If anything fails, name the failing
          test, the likely root cause, and the one-line fix. Do not edit files." \
            --allowedTools "Bash(npm run test*)" "Read" \
            --output-format json > triage.json
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Because only Bash(npm run test*) and Read are allowed, the triage run can diagnose but never silently rewrite your tests.

Codex Cloud runs the task in an isolated worktree, so you can hand off “write the missing tests for this route” without tying up your machine. From the terminal:

# Submit a Cloud task that writes tests in an isolated environment
codex cloud exec --env my-ci-env \
  "Add Vitest integration tests for POST /orders covering the 402, 503,
   and idempotency cases. Run the suite and make it green."

# When it finishes, pull the diff into your local tree to review
codex apply <TASK_ID>

For a non-interactive run, codex exec "..." streams results to stdout. Keep local or supervised runs on --sandbox workspace-write -c approval_policy=on-request. Only in trusted unattended CI, on an isolated checkout with review gates, use codex exec --sandbox workspace-write -c approval_policy=never "...".

A Reusable Slash Command for Test Generation

The .claude/commands/ directory turns a good prompt into a one-word command. The trick the generic version misses: name the stack and the mandatory error paths inside the command body, so every invocation produces a real recipe instead of vague “comprehensive tests.”

Write Vitest + supertest integration tests for the Express route: $ARGUMENTS

Stack: Express, Drizzle ORM (Postgres), Vitest, supertest.
Run the real middleware + handler. Mock only external services and the DB.

Always include these cases, one it() each:
- happy path (correct status + persisted row)
- input validation (400 on missing required fields)
- external dependency throws (assert the mapped error status, e.g. 402)
- DB write fails AFTER an external side effect (assert a retryable 5xx)
- idempotency: a repeated request with the same key has no double effect

Assert status codes and specific body fields. No snapshot tests.
Reset mocks in beforeEach. Run the suite and report pass/fail per case.

Invoke it in a Claude Code session with /test-route POST /orders. The same prompt body works as a Cursor saved prompt or a Codex prompt — the recipe is portable; only the invocation differs.

When This Breaks

The suite is green but production still breaks. You are testing mocks, not seams. Mocking the function under test means the assertion is circular. Re-run the integration prompt above and force real execution of everything except the true external boundary (the payment HTTP call, the DB driver). If your mock and the real API drift, add a contract test that hits a sandbox endpoint nightly.

The agent “fixed” a failing test by weakening the assertion. This is the most common AI testing failure: asked to make tests pass, it edits the test instead of the code. Prevent it structurally — in Claude Code, omit Edit on the test path from --allowedTools during the implement step; in Cursor, restore to the pre-implement checkpoint and re-prompt with “without editing the test file.”

E2E passes locally, flakes in CI. Almost always a race. Grep the generated test for waitForTimeout and delete every hit, then ask the agent to replace each with a web-first assertion or toHaveURL. CI is slower than your laptop, so any fixed wait that “works” locally is a time bomb.

Tests assert implementation, not behavior. If renaming a private method breaks twenty tests, the agent over-coupled them. Prompt: “These tests break on safe refactors. Rewrite them to assert observable behavior — inputs, outputs, status codes, persisted state — never private method names or call order unless ordering is the contract.”

Coverage is high but bugs still ship. Coverage measures lines executed, not assertions made. A test can run a line and assert nothing. Ask the agent to mutation test a critical module: “Introduce three plausible bugs in src/routes/orders.ts one at a time and tell me which tests catch each. Any bug that nothing catches reveals a missing assertion.”

What’s Next

Debugging Workflows Turn a failing test into a root-cause fix

MCP Setup Wire up Postgres and Playwright MCP for real-data tests

Custom Commands Build a library of slash-command recipes

CI/CD Integration Run agents headlessly in your pipeline