Testing Integration
Your checkout flow passes every unit test, the coverage badge says 94%, and it still 500s in production because nothing ever exercised the path where the payments provider times out mid-request. The tests were green. They were also testing the wrong thing — mostly your mocks asserting that your mocks were called. What you need is a suite that catches integration bugs, not a wall of green checkmarks that lie to you.
This is where an AI coding agent earns its keep. Not by generating a hundred trivial expect(sum(1,2)).toBe(3) tests, but by reading your real code, finding the failure modes you skipped, and writing tests that fail for the right reasons. This article shows the workflow across Cursor, Claude Code, and Codex.
What You’ll Walk Away With
Section titled “What You’ll Walk Away With”- A TDD loop where the agent writes a failing test first, then the implementation — so you know the test can actually fail
- A reusable prompt that generates integration tests for an Express + Drizzle route, including the DB-failure path that asserts a real
503 - A Playwright E2E prompt that survives in CI instead of flaking on the third run
- A repeatable way to fix flaky tests by attacking the root cause, not papering over it with
sleep() - A headless
claude -ptest command and a.claude/commandsrecipe you can drop into a real repo today
The Running Example
Section titled “The Running Example”We will test one real endpoint throughout: an Express route that creates an order, backed by Drizzle ORM on Postgres. Nothing here is a toy — it has the two things that break in production: an external call (the payment provider) and a database write that can fail.
import { Router } from 'express';import { db } from '../db';import { orders } from '../db/schema';import { charge } from '../lib/payments';
export const ordersRouter = Router();
ordersRouter.post('/orders', async (req, res) => { const { userId, amountCents, idempotencyKey } = req.body; if (!userId || !amountCents) { return res.status(400).json({ error: 'userId and amountCents required' }); }
try { const payment = await charge({ amountCents, idempotencyKey }); const [order] = await db .insert(orders) .values({ userId, amountCents, paymentId: payment.id, status: 'paid' }) .returning(); return res.status(201).json(order); } catch (err) { if (err instanceof PaymentError) return res.status(402).json({ error: 'payment_failed' }); // DB write failed after a successful charge -- the dangerous case return res.status(503).json({ error: 'order_persist_failed', retryable: true }); }});The interesting test is not “201 on the happy path.” It is: the charge succeeded but the DB insert threw — do we return a retryable 503, and do we avoid double-charging on retry? That is the bug that pages you at 2am.
The TDD Loop That Actually Catches Bugs
Section titled “The TDD Loop That Actually Catches Bugs”The discipline that makes AI-generated tests trustworthy is simple: make the test fail before you let the agent implement anything. A test that has never been red is not a test, it is a comment. Drive the agent through red → green → refactor explicitly.
-
Write the failing test first. Tell the agent to write the test for a behavior that does not exist yet, and to stop before implementing.
-
Run it and confirm it fails for the right reason. Not a typo, not a missing import — a genuine assertion failure or a 404 on a route you have not built.
-
Implement to green. Let the agent write the minimum code to pass, with an explicit instruction not to touch the test.
-
Refactor under a green bar. Now the suite is your safety net for cleanup.
The mechanics of running that loop differ per tool. The prompt is nearly identical; how you keep the agent honest is not.
Use Agent mode and lean on checkpoints. Before the implement step, the failing test is a natural checkpoint — if the agent “fixes” the test instead of the code (a classic failure), restore to that checkpoint and re-prompt.
In Composer, with orders.ts and the empty orders.test.ts in context:
Write a Vitest integration test for POST /orders covering the case wherecharge() resolves but the Drizzle insert rejects. Assert a 503 with{ retryable: true }. Do NOT implement the route yet -- the test must fail.Run npm run test in Cursor’s terminal, watch it go red, then start a new prompt: “Now make this pass without editing the test file.” Keep “Iterate on lints” on so type errors get fixed in the same turn.
Claude Code shines here because you can gate the loop with permissions and run it in CI. Restrict the first step to writing tests only:
claude -p "Write a Vitest integration test for POST /orders: charge()succeeds but the Drizzle insert rejects, expect 503 with retryable:true.Do not implement the route." \ --allowedTools "Read" "Write" "Bash(npm run test*)"Because Edit on src/routes/ is not in --allowedTools, the agent physically cannot “fix” the route to make a half-baked test pass — it has to write a test that fails honestly. Then drop the gate and run the implement step interactively.
In the Codex TUI, set approvals so the test run is automatic but edits are visible:
codex --ask-for-approval on-requestPrompt it to write the failing test, let it run the suite (Codex executes the command in its sandbox), and review the red output before approving the implementation diff. For a hands-off local loop, codex --full-auto sets on-request approval with a workspace-write sandbox — good once you trust the prompt.
Integration Tests: Test the Seams, Not the Mocks
Section titled “Integration Tests: Test the Seams, Not the Mocks”The failure in the opening scenario happened at a seam — the boundary between your code and an external service. Over-mocked suites pass precisely because they never touch those seams. The fix is to mock at the edges (the HTTP boundary of the payment provider, the failure behavior of the DB) and run everything in between for real.
A strong integration-test prompt is specific about three things: what to mock, what to run for real, and which error paths are mandatory.
Here is the shape of what a good agent produces for the dangerous case — note it asserts behavior (status, body, call count), never implementation details:
import request from 'supertest';import { describe, it, expect, vi, beforeEach } from 'vitest';import { app } from '../app';import * as payments from '../lib/payments';import { db } from '../db';
beforeEach(() => vi.restoreAllMocks());
it('returns retryable 503 when the DB write fails after a charge', async () => { vi.spyOn(payments, 'charge').mockResolvedValue({ id: 'pay_123' }); vi.spyOn(db, 'insert').mockImplementation(() => { throw new Error('connection terminated'); });
const res = await request(app) .post('/orders') .send({ userId: 'u1', amountCents: 4999, idempotencyKey: 'k1' });
expect(res.status).toBe(503); expect(res.body).toMatchObject({ retryable: true }); expect(payments.charge).toHaveBeenCalledTimes(1);});Where MCP servers change the integration story
Section titled “Where MCP servers change the integration story”If you spin up a real Postgres for integration tests instead of mocking the DB, the Postgres MCP server (@modelcontextprotocol/server-postgres) lets the agent inspect your live schema and write tests that match real column constraints, not its guess at your schema. Connect it once and the prompt changes from “assume a schema” to “read the orders table and assert against its real NOT NULL constraints.” For browser-driven E2E, the Playwright MCP (@playwright/mcp) lets the agent drive a real page and read the DOM while it writes the test, instead of inventing selectors.
End-to-End: The Tests That Flake in CI
Section titled “End-to-End: The Tests That Flake in CI”E2E tests fail in CI for one reason more than any other: the test races the application. The agent clicked before the button was interactive, asserted before the network call resolved, or relied on a fixed waitForTimeout. The cure is to ban arbitrary waits and force web-first assertions and accessible locators.
The “run it 3 times” instruction at the end is doing real work: it turns a one-shot generation into a stability check before the test ever reaches your pipeline.
Wiring Tests Into CI
Section titled “Wiring Tests Into CI”The payoff of headless agents is that test generation and triage can run in CI, not just on your laptop. The three tools take different routes.
Cursor is IDE-first, so the CI half is your normal test runner — Cursor’s value is authoring. A realistic GitHub Actions job that runs the Vitest suite Cursor helped you write:
name: teston: [pull_request]jobs: vitest: runs-on: ubuntu-latest steps: - uses: actions/checkout@v5 - uses: actions/setup-node@v5 with: node-version: 22 cache: npm - run: npm ci - run: npm run test -- --coverageUse Cursor’s Background Agent to draft tests for a new route on a branch while you keep working, then review the diff before it hits this pipeline.
Run Claude Code headlessly to triage a failing suite on every PR and post a structured summary. --output-format json makes the result machine-readable:
name: test-triageon: [pull_request]jobs: triage: runs-on: ubuntu-latest steps: - uses: actions/checkout@v5 - uses: actions/setup-node@v5 with: node-version: 22 - run: npm ci - run: | claude -p "Run npm run test. If anything fails, name the failing test, the likely root cause, and the one-line fix. Do not edit files." \ --allowedTools "Bash(npm run test*)" "Read" \ --output-format json > triage.json env: ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Because only Bash(npm run test*) and Read are allowed, the triage run can diagnose but never silently rewrite your tests.
Codex Cloud runs the task in an isolated worktree, so you can hand off “write the missing tests for this route” without tying up your machine. From the terminal:
# Submit a Cloud task that writes tests in an isolated environmentcodex cloud exec --env my-ci-env \ "Add Vitest integration tests for POST /orders covering the 402, 503, and idempotency cases. Run the suite and make it green."
# When it finishes, pull the diff into your local tree to reviewcodex apply <TASK_ID>For a local non-interactive run inside CI, codex exec "..." streams results to stdout (add --full-auto for an autonomous workspace-write sandbox).
A Reusable Slash Command for Test Generation
Section titled “A Reusable Slash Command for Test Generation”The .claude/commands/ directory turns a good prompt into a one-word command. The trick the generic version misses: name the stack and the mandatory error paths inside the command body, so every invocation produces a real recipe instead of vague “comprehensive tests.”
Write Vitest + supertest integration tests for the Express route: $ARGUMENTS
Stack: Express, Drizzle ORM (Postgres), Vitest, supertest.Run the real middleware + handler. Mock only external services and the DB.
Always include these cases, one it() each:- happy path (correct status + persisted row)- input validation (400 on missing required fields)- external dependency throws (assert the mapped error status, e.g. 402)- DB write fails AFTER an external side effect (assert a retryable 5xx)- idempotency: a repeated request with the same key has no double effect
Assert status codes and specific body fields. No snapshot tests.Reset mocks in beforeEach. Run the suite and report pass/fail per case.Invoke it in a Claude Code session with /test-route POST /orders. The same prompt body works as a Cursor saved prompt or a Codex prompt — the recipe is portable; only the invocation differs.
When This Breaks
Section titled “When This Breaks”The suite is green but production still breaks. You are testing mocks, not seams. Mocking the function under test means the assertion is circular. Re-run the integration prompt above and force real execution of everything except the true external boundary (the payment HTTP call, the DB driver). If your mock and the real API drift, add a contract test that hits a sandbox endpoint nightly.
The agent “fixed” a failing test by weakening the assertion. This is the most common AI testing failure: asked to make tests pass, it edits the test instead of the code. Prevent it structurally — in Claude Code, omit Edit on the test path from --allowedTools during the implement step; in Cursor, restore to the pre-implement checkpoint and re-prompt with “without editing the test file.”
E2E passes locally, flakes in CI. Almost always a race. Grep the generated test for waitForTimeout and delete every hit, then ask the agent to replace each with a web-first assertion or toHaveURL. CI is slower than your laptop, so any fixed wait that “works” locally is a time bomb.
Tests assert implementation, not behavior. If renaming a private method breaks twenty tests, the agent over-coupled them. Prompt: “These tests break on safe refactors. Rewrite them to assert observable behavior — inputs, outputs, status codes, persisted state — never private method names or call order unless ordering is the contract.”
Coverage is high but bugs still ship. Coverage measures lines executed, not assertions made. A test can run a line and assert nothing. Ask the agent to mutation test a critical module: “Introduce three plausible bugs in src/routes/orders.ts one at a time and tell me which tests catch each. Any bug that nothing catches reveals a missing assertion.”