Test-Driven Development with AI

You ask the agent for a discount-calculation helper. It hands back code that looks right, the demo passes, you merge. Two days later finance reports that orders with expired coupons are still getting the discount — the AI wrote the implementation and the tests together, so the tests only ever asserted the behavior it had already coded. Tests written after the code mostly prove the code does what the code does.

Test-Driven Development flips that. You write the tests first, confirm they fail for the right reason, then let the AI implement against a target it cannot fudge. The red-green-refactor loop is exactly the kind of tight, verifiable cycle agents are good at — and the “confirm red” step is what stops the AI from writing tests that pass against nothing.

What You’ll Walk Away With

A repeatable red-green-refactor loop you drive with the agent instead of typing every line
Copy-paste prompts that pin the AI to one phase at a time (tests, then code, then refactor)
The Cursor / Claude Code / Codex variant of the loop, including how to let the test suite run unattended
The failure modes that quietly break AI-driven TDD — and how to catch the AI editing tests to force green

The Workflow

The classic cycle is “Red, Green, Refactor.” With an agent, each phase becomes a separate, narrow instruction. The discipline that makes it work: never let the same turn write both the failing test and the code that satisfies it.

Write the tests (Red). Specify the behavior and the edge cases, and explicitly forbid the implementation. You are describing a contract, not asking for a feature.
Confirm failure (Red). Have the agent run the suite and show you the failures. This proves the tests target real, unimplemented behavior — not a typo’d import that “fails” for the wrong reason.
Implement to pass (Green). Now give a single narrow instruction: make these tests pass, do not touch the test files. The target is unambiguous and machine-checkable.
Iterate and refactor. The agent runs the suite, reads failures, and adjusts until green. Once green, ask for a refactor pass — the tests are now your safety net.

A worked example: a discount rule with real edge cases

Start with a deliberately small first pass so you can see the loop, then immediately move to a production-shaped case — a service method with error paths and a mocked dependency, not a pure function in isolation.

Phase 1 — tests only. Pin the model to writing tests and nothing else:

Phase 2 — confirm red. Don’t skip this. A test file that imports a module which doesn’t exist yet should fail at resolution; a test that passes here is a test that asserts nothing.

Phase 3 — implement to green. Only now do you authorize implementation, and you fence off the tests:

That last sentence is load-bearing. Without it, an agent that gets stuck will often “fix” the failing assertion rather than the code.

Phase 4 — refactor under green. With the suite passing you have a contract that lets you safely restructure:

Driving the loop in each tool

The phases are identical everywhere; what differs is how each tool runs the suite and how much of the test-fix-retest loop it will do unattended.

Use Agent mode and let it run the tests itself. In Settings -> Cursor Settings -> Agents -> Auto-Run, set Auto-Run Mode to Run in Sandbox (on macOS/Linux) so commands execute automatically inside the sandbox without prompting — this is the unattended path Cursor recommends. Then add npx vitest (or npm test) to the Command Allowlist so the test runner runs immediately even outside the sandbox. Avoid the Run Everything mode for an unattended loop: Cursor’s own security guidance says never use it, because it skips all safety checks. Keep the phases as separate chat turns — Cursor’s checkpoints let you roll back to “red” if the green pass goes sideways. Watch the diff view: if a green-phase edit touches a *.test.ts file, reject it.

Drive it from the REPL, or script it headlessly. Interactively, paste each phase prompt in turn and let Claude run npm test. To make the loop self-correcting, add a PostToolUse hook (matcher Edit|Write) in .claude/settings.json that re-runs the suite after every edit so Claude sees failures immediately:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [{ "type": "command", "command": "npx vitest run --reporter=dot" }]
      }
    ]
  }
}

For a one-shot green phase in CI or a script, run it headless:

claude -p "Implement src/services/pricing.ts so the suite passes. Run 'npx vitest run' and iterate until green. Do not edit any *.test.ts file." --allowedTools "Read,Edit,Write,Bash"

In the TUI, configure the workspace sandbox and interactive approval policy explicitly. Routine edits and test commands allowed by the sandbox can iterate without a prompt:

codex --sandbox workspace-write -c approval_policy=on-request \
  "Implement src/services/pricing.ts to pass tests in src/services/pricing.test.ts. Run 'npx vitest run' and iterate until green. Don't touch the test files."

on-request prompts only when Codex asks to cross the sandbox boundary; a failed test is ordinary tool output, not an approval event. For parallel red/green experiments — say, trying two implementations against the same locked tests — create a separate git worktree for each run so the suites do not collide.

When This Breaks

The AI edits the tests to force green. The most common failure: stuck on a hard case, the agent quietly relaxes an assertion or deletes one so the suite passes. Always end your green-phase prompt with “do not modify any test file,” and diff the test files before you trust a green run. A passing suite that shrank is a red flag.

Tests pass before the code exists. If “confirm red” comes back green, your tests aren’t actually exercising the target — usually a mocked dependency returns a truthy default, or the import resolves to a stub. Never skip the confirm-red step.

Flaky async tests get “fixed” with sleeps. When timing-dependent tests fail intermittently, agents love to paper over it with setTimeout/sleep. Instead, prompt for deterministic control: “use fake timers (vi.useFakeTimers()) and advance them explicitly; do not add real delays.”

The suite is too slow to loop on. If the full suite takes minutes, the iterate phase crawls. Scope the runner to the file under test (npx vitest run src/services/pricing.test.ts) during the loop, then run the whole suite once at the end.

What’s Next

Behavior-Driven Development Push the same idea up to scenarios: drive the agent from Given/When/Then specs.

Error-Driven Development When the failing signal is a production error rather than a test, run the agent off that instead.

Unit Testing with AI Deeper patterns for mocking, fixtures, and coverage once the red-green loop is muscle memory.