Test-Driven Development with AI
You ask the agent for a discount-calculation helper. It hands back code that looks right, the demo passes, you merge. Two days later finance reports that orders with expired coupons are still getting the discount — the AI wrote the implementation and the tests together, so the tests only ever asserted the behavior it had already coded. Tests written after the code mostly prove the code does what the code does.
Test-Driven Development flips that. You write the tests first, confirm they fail for the right reason, then let the AI implement against a target it cannot fudge. The red-green-refactor loop is exactly the kind of tight, verifiable cycle agents are good at — and the “confirm red” step is what stops the AI from writing tests that pass against nothing.
What You’ll Walk Away With
Section titled “What You’ll Walk Away With”- A repeatable red-green-refactor loop you drive with the agent instead of typing every line
- Copy-paste prompts that pin the AI to one phase at a time (tests, then code, then refactor)
- The Cursor / Claude Code / Codex variant of the loop, including how to let the test suite run unattended
- The failure modes that quietly break AI-driven TDD — and how to catch the AI editing tests to force green
The Workflow
Section titled “The Workflow”The classic cycle is “Red, Green, Refactor.” With an agent, each phase becomes a separate, narrow instruction. The discipline that makes it work: never let the same turn write both the failing test and the code that satisfies it.
-
Write the tests (Red). Specify the behavior and the edge cases, and explicitly forbid the implementation. You are describing a contract, not asking for a feature.
-
Confirm failure (Red). Have the agent run the suite and show you the failures. This proves the tests target real, unimplemented behavior — not a typo’d import that “fails” for the wrong reason.
-
Implement to pass (Green). Now give a single narrow instruction: make these tests pass, do not touch the test files. The target is unambiguous and machine-checkable.
-
Iterate and refactor. The agent runs the suite, reads failures, and adjusts until green. Once green, ask for a refactor pass — the tests are now your safety net.
A worked example: a discount rule with real edge cases
Section titled “A worked example: a discount rule with real edge cases”Start with a deliberately small first pass so you can see the loop, then immediately move to a production-shaped case — a service method with error paths and a mocked dependency, not a pure function in isolation.
Phase 1 — tests only. Pin the model to writing tests and nothing else:
Phase 2 — confirm red. Don’t skip this. A test file that imports a module which doesn’t exist yet should fail at resolution; a test that passes here is a test that asserts nothing.
Phase 3 — implement to green. Only now do you authorize implementation, and you fence off the tests:
That last sentence is load-bearing. Without it, an agent that gets stuck will often “fix” the failing assertion rather than the code.
Phase 4 — refactor under green. With the suite passing you have a contract that lets you safely restructure:
Driving the loop in each tool
Section titled “Driving the loop in each tool”The phases are identical everywhere; what differs is how each tool runs the suite and how much of the test-fix-retest loop it will do unattended.
Use Agent mode and let it run the tests itself. In Settings -> Cursor Settings -> Agents -> Auto-Run, set Auto-Run Mode to Run in Sandbox (on macOS/Linux) so commands execute automatically inside the sandbox without prompting — this is the unattended path Cursor recommends. Then add npx vitest (or npm test) to the Command Allowlist so the test runner runs immediately even outside the sandbox. Avoid the Run Everything mode for an unattended loop: Cursor’s own security guidance says never use it, because it skips all safety checks. Keep the phases as separate chat turns — Cursor’s checkpoints let you roll back to “red” if the green pass goes sideways. Watch the diff view: if a green-phase edit touches a *.test.ts file, reject it.
Drive it from the REPL, or script it headlessly. Interactively, paste each phase prompt in turn and let Claude run npm test. To make the loop self-correcting, add a PostToolUse hook (matcher Edit|Write) in .claude/settings.json that re-runs the suite after every edit so Claude sees failures immediately:
{ "hooks": { "PostToolUse": [ { "matcher": "Edit|Write", "hooks": [{ "type": "command", "command": "npx vitest run --reporter=dot" }] } ] }}For a one-shot green phase in CI or a script, run it headless:
claude -p "Implement src/services/pricing.ts so the suite passes. Run 'npx vitest run' and iterate until green. Do not edit any *.test.ts file." --allowedTools "Read,Edit,Write,Bash"In the TUI, run the green phase with --full-auto (which sets workspace-write sandbox and on-request approvals) so Codex can run the test command and iterate without prompting on each step:
codex --full-auto "Implement src/services/pricing.ts to pass tests in src/services/pricing.test.ts. Run 'npx vitest run' and iterate until green. Don't touch the test files."If you want a prompt only when something actually fails, use codex --ask-for-approval on-failure instead. For parallel red/green experiments — say, trying two different implementations against the same locked tests — run each in its own git worktree so the suites don’t collide.