1. Define Behavior (Given/When/Then)
You start by writing a feature specification using the Gherkin syntax (Given, When, Then). This describes a user scenario in plain English. You can even collaborate with the AI to refine these user stories.
The checkout feature passes every unit test and still ships the wrong thing: the acceptance criteria lived in a Jira ticket nobody fed to the agent, so the AI optimized for “tests are green” instead of “the user can actually buy the thing.” Behavior-Driven Development closes that gap by making the executable spec — Gherkin scenarios — the source of truth the agent codes against.
AI assistants are unusually good BDD partners because the whole methodology is natural-language-first. You define the behavior in Given/When/Then, the agent generates the step definitions and implementation, and the BDD runner is the objective gate that tells both of you whether the behavior is real.
.feature file into step definitions, implementation, and a passing suite..feature file and running the runner.The BDD workflow with an AI assistant follows a clear, collaborative cycle that ensures the final product matches the desired user experience.
1. Define Behavior (Given/When/Then)
You start by writing a feature specification using the Gherkin syntax (Given, When, Then). This describes a user scenario in plain English. You can even collaborate with the AI to refine these user stories.
2. Generate Step Definitions
You provide the .feature file to the AI and ask it to generate the corresponding step definition files for your testing framework (e.g., Cucumber, Behat, SpecFlow). These will be initially empty or contain placeholder code.
3. Implement the Feature
With the behavioral contract in place, you instruct the AI to write the application code necessary to make the scenarios in the .feature file pass.
4. Run, Verify, and Refactor
Finally, the AI runs the BDD tests. It will analyze any failures and iterate on the application code until all scenarios pass. This creates a tight feedback loop that is driven directly by the user-facing behavior.
Let’s drive a real rule end to end: a stock-limited “Add to Cart” for a Next.js storefront, tested with Cucumber.js + Playwright against the running app. The interesting behavior isn’t “the button works” — it’s the boundary: you can’t add more units than are in stock. That edge case is exactly what gets lost when criteria live in a ticket instead of a spec.
Encode the rule, including the failure path, in features/cart.feature:
Feature: Stock-limited cart As a shopper I want the cart to respect available stock So that I can't order items the warehouse doesn't have
Background: Given the product "Super Widget" has 2 units in stock
Scenario: Adding within the stock limit Given I am on the product page for "Super Widget" When I add 2 of "Super Widget" to my cart Then my cart should contain 2 "Super Widget"
Scenario: Blocked from exceeding the stock limit Given I am on the product page for "Super Widget" When I add 2 of "Super Widget" to my cart And I try to add 1 more "Super Widget" Then I should see "Only 2 in stock" And my cart should contain 2 "Super Widget"The second scenario is the whole point: it’s the behavior unit tests usually skip and the AI would otherwise never know to build.
Hand the agent the .feature file and ask for typed step definitions wired to Playwright. How you reference the file differs per tool — @-mentions are a Cursor idiom, not a Claude Code or Codex one.
In agent mode, mention the spec with @features/cart.feature so it’s pinned in context, then paste the prompt. Review the generated cart.steps.ts in the diff view before accepting — Cursor’s inline diff makes it easy to catch a step that quietly stubs an assertion.
Reference the path in plain text (Read features/cart.feature) — there’s no @-mention affordance in the terminal. Claude Code reads the file with its own tools and can scaffold the steps directory in one turn. Keep it in the same session so it remembers the World setup when you move to implementation.
Run in the IDE or CLI surface on the worktree that has the feature file. Point it at the path in the prompt; Codex opens the file itself. On a clean worktree you can let it write the steps directory and the cucumber.js config in one task.
With the executable contract in place, have the agent build the implementation — the cart store, the stock guard, and the inline error — until the scenarios can pass.
Naming the stack (Zustand store, typed result, inline message) keeps the agent from inventing a global event bus or a toast library you don’t use. The “do not modify the spec or weaken assertions” guardrail is the single most important line in BDD-with-AI — see “When this breaks” below.
Now let the agent close the loop against real output. This is where the three tools genuinely diverge: who runs the command.
Agent mode can run npm run test:bdd in the integrated terminal and read the output, but for a long red-to-green loop, the background agent is better: hand it the suite and let it iterate while you keep working, then review the final diff.
This is Claude Code’s strongest surface: it runs the suite with the Bash tool, parses the human-readable Cucumber report, and iterates in-session. Scope it with --allowedTools (e.g. Edit, Bash) so it can run tests and patch code but you stay in control of anything else.
Run the suite from the Codex CLI/IDE on the worktree, or kick it to Codex Cloud for a longer unattended red-to-green pass and review the resulting PR. Keep approvals at on-request (--ask-for-approval on-request) so it pauses before anything outside running the suite and editing app code.
BDD-with-AI fails in specific, recognizable ways. Watch for these.
.feature files encode.