Skip to content

Testing Integration

Your checkout flow passes every unit test, the coverage badge says 94%, and it still 500s in production because nothing ever exercised the path where the payments provider times out mid-request. The tests were green. They were also testing the wrong thing — mostly your mocks asserting that your mocks were called. What you need is a suite that catches integration bugs, not a wall of green checkmarks that lie to you.

This is where an AI coding agent earns its keep. Not by generating a hundred trivial expect(sum(1,2)).toBe(3) tests, but by reading your real code, finding the failure modes you skipped, and writing tests that fail for the right reasons. This article shows the workflow across Cursor, Claude Code, and Codex.

  • A TDD loop where the agent writes a failing test first, then the implementation — so you know the test can actually fail
  • A reusable prompt that generates integration tests for an Express + Drizzle route, including the DB-failure path that asserts a real 503
  • A Playwright E2E prompt that survives in CI instead of flaking on the third run
  • A repeatable way to fix flaky tests by attacking the root cause, not papering over it with sleep()
  • A headless claude -p test command and a .claude/commands recipe you can drop into a real repo today

We will test one real endpoint throughout: an Express route that creates an order, backed by Drizzle ORM on Postgres. Nothing here is a toy — it has the two things that break in production: an external call (the payment provider) and a database write that can fail.

src/routes/orders.ts
import { Router } from 'express';
import { db } from '../db';
import { orders } from '../db/schema';
import { charge } from '../lib/payments';
export const ordersRouter = Router();
ordersRouter.post('/orders', async (req, res) => {
const { userId, amountCents, idempotencyKey } = req.body;
if (!userId || !amountCents) {
return res.status(400).json({ error: 'userId and amountCents required' });
}
try {
const payment = await charge({ amountCents, idempotencyKey });
const [order] = await db
.insert(orders)
.values({ userId, amountCents, paymentId: payment.id, status: 'paid' })
.returning();
return res.status(201).json(order);
} catch (err) {
if (err instanceof PaymentError) return res.status(402).json({ error: 'payment_failed' });
// DB write failed after a successful charge -- the dangerous case
return res.status(503).json({ error: 'order_persist_failed', retryable: true });
}
});

The interesting test is not “201 on the happy path.” It is: the charge succeeded but the DB insert threw — do we return a retryable 503, and do we avoid double-charging on retry? That is the bug that pages you at 2am.

The discipline that makes AI-generated tests trustworthy is simple: make the test fail before you let the agent implement anything. A test that has never been red is not a test, it is a comment. Drive the agent through red → green → refactor explicitly.

  1. Write the failing test first. Tell the agent to write the test for a behavior that does not exist yet, and to stop before implementing.

  2. Run it and confirm it fails for the right reason. Not a typo, not a missing import — a genuine assertion failure or a 404 on a route you have not built.

  3. Implement to green. Let the agent write the minimum code to pass, with an explicit instruction not to touch the test.

  4. Refactor under a green bar. Now the suite is your safety net for cleanup.

The mechanics of running that loop differ per tool. The prompt is nearly identical; how you keep the agent honest is not.

Use Agent mode and lean on checkpoints. Before the implement step, the failing test is a natural checkpoint — if the agent “fixes” the test instead of the code (a classic failure), restore to that checkpoint and re-prompt.

In Composer, with orders.ts and the empty orders.test.ts in context:

Write a Vitest integration test for POST /orders covering the case where
charge() resolves but the Drizzle insert rejects. Assert a 503 with
{ retryable: true }. Do NOT implement the route yet -- the test must fail.

Run npm run test in Cursor’s terminal, watch it go red, then start a new prompt: “Now make this pass without editing the test file.” Keep “Iterate on lints” on so type errors get fixed in the same turn.

Integration Tests: Test the Seams, Not the Mocks

Section titled “Integration Tests: Test the Seams, Not the Mocks”

The failure in the opening scenario happened at a seam — the boundary between your code and an external service. Over-mocked suites pass precisely because they never touch those seams. The fix is to mock at the edges (the HTTP boundary of the payment provider, the failure behavior of the DB) and run everything in between for real.

A strong integration-test prompt is specific about three things: what to mock, what to run for real, and which error paths are mandatory.

Here is the shape of what a good agent produces for the dangerous case — note it asserts behavior (status, body, call count), never implementation details:

import request from 'supertest';
import { describe, it, expect, vi, beforeEach } from 'vitest';
import { app } from '../app';
import * as payments from '../lib/payments';
import { db } from '../db';
beforeEach(() => vi.restoreAllMocks());
it('returns retryable 503 when the DB write fails after a charge', async () => {
vi.spyOn(payments, 'charge').mockResolvedValue({ id: 'pay_123' });
vi.spyOn(db, 'insert').mockImplementation(() => {
throw new Error('connection terminated');
});
const res = await request(app)
.post('/orders')
.send({ userId: 'u1', amountCents: 4999, idempotencyKey: 'k1' });
expect(res.status).toBe(503);
expect(res.body).toMatchObject({ retryable: true });
expect(payments.charge).toHaveBeenCalledTimes(1);
});

Where MCP servers change the integration story

Section titled “Where MCP servers change the integration story”

If you spin up a real Postgres for integration tests instead of mocking the DB, the Postgres MCP server (@modelcontextprotocol/server-postgres) lets the agent inspect your live schema and write tests that match real column constraints, not its guess at your schema. Connect it once and the prompt changes from “assume a schema” to “read the orders table and assert against its real NOT NULL constraints.” For browser-driven E2E, the Playwright MCP (@playwright/mcp) lets the agent drive a real page and read the DOM while it writes the test, instead of inventing selectors.

E2E tests fail in CI for one reason more than any other: the test races the application. The agent clicked before the button was interactive, asserted before the network call resolved, or relied on a fixed waitForTimeout. The cure is to ban arbitrary waits and force web-first assertions and accessible locators.

The “run it 3 times” instruction at the end is doing real work: it turns a one-shot generation into a stability check before the test ever reaches your pipeline.

The payoff of headless agents is that test generation and triage can run in CI, not just on your laptop. The three tools take different routes.

Cursor is IDE-first, so the CI half is your normal test runner — Cursor’s value is authoring. A realistic GitHub Actions job that runs the Vitest suite Cursor helped you write:

.github/workflows/test.yml
name: test
on: [pull_request]
jobs:
vitest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v5
- uses: actions/setup-node@v5
with:
node-version: 22
cache: npm
- run: npm ci
- run: npm run test -- --coverage

Use Cursor’s Background Agent to draft tests for a new route on a branch while you keep working, then review the diff before it hits this pipeline.

A Reusable Slash Command for Test Generation

Section titled “A Reusable Slash Command for Test Generation”

The .claude/commands/ directory turns a good prompt into a one-word command. The trick the generic version misses: name the stack and the mandatory error paths inside the command body, so every invocation produces a real recipe instead of vague “comprehensive tests.”

.claude/commands/test-route.md
Write Vitest + supertest integration tests for the Express route: $ARGUMENTS
Stack: Express, Drizzle ORM (Postgres), Vitest, supertest.
Run the real middleware + handler. Mock only external services and the DB.
Always include these cases, one it() each:
- happy path (correct status + persisted row)
- input validation (400 on missing required fields)
- external dependency throws (assert the mapped error status, e.g. 402)
- DB write fails AFTER an external side effect (assert a retryable 5xx)
- idempotency: a repeated request with the same key has no double effect
Assert status codes and specific body fields. No snapshot tests.
Reset mocks in beforeEach. Run the suite and report pass/fail per case.

Invoke it in a Claude Code session with /test-route POST /orders. The same prompt body works as a Cursor saved prompt or a Codex prompt — the recipe is portable; only the invocation differs.

The suite is green but production still breaks. You are testing mocks, not seams. Mocking the function under test means the assertion is circular. Re-run the integration prompt above and force real execution of everything except the true external boundary (the payment HTTP call, the DB driver). If your mock and the real API drift, add a contract test that hits a sandbox endpoint nightly.

The agent “fixed” a failing test by weakening the assertion. This is the most common AI testing failure: asked to make tests pass, it edits the test instead of the code. Prevent it structurally — in Claude Code, omit Edit on the test path from --allowedTools during the implement step; in Cursor, restore to the pre-implement checkpoint and re-prompt with “without editing the test file.”

E2E passes locally, flakes in CI. Almost always a race. Grep the generated test for waitForTimeout and delete every hit, then ask the agent to replace each with a web-first assertion or toHaveURL. CI is slower than your laptop, so any fixed wait that “works” locally is a time bomb.

Tests assert implementation, not behavior. If renaming a private method breaks twenty tests, the agent over-coupled them. Prompt: “These tests break on safe refactors. Rewrite them to assert observable behavior — inputs, outputs, status codes, persisted state — never private method names or call order unless ordering is the contract.”

Coverage is high but bugs still ship. Coverage measures lines executed, not assertions made. A test can run a line and assert nothing. Ask the agent to mutation test a critical module: “Introduce three plausible bugs in src/routes/orders.ts one at a time and tell me which tests catch each. Any bug that nothing catches reveals a missing assertion.”