Skip to content

Test-Driven Development with AI

You ask the agent for a discount-calculation helper. It hands back code that looks right, the demo passes, you merge. Two days later finance reports that orders with expired coupons are still getting the discount — the AI wrote the implementation and the tests together, so the tests only ever asserted the behavior it had already coded. Tests written after the code mostly prove the code does what the code does.

Test-Driven Development flips that. You write the tests first, confirm they fail for the right reason, then let the AI implement against a target it cannot fudge. The red-green-refactor loop is exactly the kind of tight, verifiable cycle agents are good at — and the “confirm red” step is what stops the AI from writing tests that pass against nothing.

  • A repeatable red-green-refactor loop you drive with the agent instead of typing every line
  • Copy-paste prompts that pin the AI to one phase at a time (tests, then code, then refactor)
  • The Cursor / Claude Code / Codex variant of the loop, including how to let the test suite run unattended
  • The failure modes that quietly break AI-driven TDD — and how to catch the AI editing tests to force green

The classic cycle is “Red, Green, Refactor.” With an agent, each phase becomes a separate, narrow instruction. The discipline that makes it work: never let the same turn write both the failing test and the code that satisfies it.

  1. Write the tests (Red). Specify the behavior and the edge cases, and explicitly forbid the implementation. You are describing a contract, not asking for a feature.

  2. Confirm failure (Red). Have the agent run the suite and show you the failures. This proves the tests target real, unimplemented behavior — not a typo’d import that “fails” for the wrong reason.

  3. Implement to pass (Green). Now give a single narrow instruction: make these tests pass, do not touch the test files. The target is unambiguous and machine-checkable.

  4. Iterate and refactor. The agent runs the suite, reads failures, and adjusts until green. Once green, ask for a refactor pass — the tests are now your safety net.

A worked example: a discount rule with real edge cases

Section titled “A worked example: a discount rule with real edge cases”

Start with a deliberately small first pass so you can see the loop, then immediately move to a production-shaped case — a service method with error paths and a mocked dependency, not a pure function in isolation.

Phase 1 — tests only. Pin the model to writing tests and nothing else:

Phase 2 — confirm red. Don’t skip this. A test file that imports a module which doesn’t exist yet should fail at resolution; a test that passes here is a test that asserts nothing.

Phase 3 — implement to green. Only now do you authorize implementation, and you fence off the tests:

That last sentence is load-bearing. Without it, an agent that gets stuck will often “fix” the failing assertion rather than the code.

Phase 4 — refactor under green. With the suite passing you have a contract that lets you safely restructure:

The phases are identical everywhere; what differs is how each tool runs the suite and how much of the test-fix-retest loop it will do unattended.

Use Agent mode and let it run the tests itself. In Settings -> Cursor Settings -> Agents -> Auto-Run, set Auto-Run Mode to Run in Sandbox (on macOS/Linux) so commands execute automatically inside the sandbox without prompting — this is the unattended path Cursor recommends. Then add npx vitest (or npm test) to the Command Allowlist so the test runner runs immediately even outside the sandbox. Avoid the Run Everything mode for an unattended loop: Cursor’s own security guidance says never use it, because it skips all safety checks. Keep the phases as separate chat turns — Cursor’s checkpoints let you roll back to “red” if the green pass goes sideways. Watch the diff view: if a green-phase edit touches a *.test.ts file, reject it.