Skip to content

Behavior-Driven Development Workflows

The checkout feature passes every unit test and still ships the wrong thing: the acceptance criteria lived in a Jira ticket nobody fed to the agent, so the AI optimized for “tests are green” instead of “the user can actually buy the thing.” Behavior-Driven Development closes that gap by making the executable spec — Gherkin scenarios — the source of truth the agent codes against.

AI assistants are unusually good BDD partners because the whole methodology is natural-language-first. You define the behavior in Given/When/Then, the agent generates the step definitions and implementation, and the BDD runner is the objective gate that tells both of you whether the behavior is real.

  • A repeatable loop that turns a .feature file into step definitions, implementation, and a passing suite.
  • Copy-paste prompts for generating Cucumber.js step definitions, implementing against the contract, and driving the suite to green.
  • The three-tool execution paths (Cursor, Claude Code, Codex) for handing the agent the .feature file and running the runner.
  • The failure modes that quietly defeat AI-driven BDD — the agent editing the spec to make tests pass, flaky async steps — and how to fence them off.

The BDD workflow with an AI assistant follows a clear, collaborative cycle that ensures the final product matches the desired user experience.

1. Define Behavior (Given/When/Then)

You start by writing a feature specification using the Gherkin syntax (Given, When, Then). This describes a user scenario in plain English. You can even collaborate with the AI to refine these user stories.

2. Generate Step Definitions

You provide the .feature file to the AI and ask it to generate the corresponding step definition files for your testing framework (e.g., Cucumber, Behat, SpecFlow). These will be initially empty or contain placeholder code.

3. Implement the Feature

With the behavioral contract in place, you instruct the AI to write the application code necessary to make the scenarios in the .feature file pass.

4. Run, Verify, and Refactor

Finally, the AI runs the BDD tests. It will analyze any failures and iterate on the application code until all scenarios pass. This creates a tight feedback loop that is driven directly by the user-facing behavior.


Let’s drive a real rule end to end: a stock-limited “Add to Cart” for a Next.js storefront, tested with Cucumber.js + Playwright against the running app. The interesting behavior isn’t “the button works” — it’s the boundary: you can’t add more units than are in stock. That edge case is exactly what gets lost when criteria live in a ticket instead of a spec.

Encode the rule, including the failure path, in features/cart.feature:

features/cart.feature
Feature: Stock-limited cart
As a shopper
I want the cart to respect available stock
So that I can't order items the warehouse doesn't have
Background:
Given the product "Super Widget" has 2 units in stock
Scenario: Adding within the stock limit
Given I am on the product page for "Super Widget"
When I add 2 of "Super Widget" to my cart
Then my cart should contain 2 "Super Widget"
Scenario: Blocked from exceeding the stock limit
Given I am on the product page for "Super Widget"
When I add 2 of "Super Widget" to my cart
And I try to add 1 more "Super Widget"
Then I should see "Only 2 in stock"
And my cart should contain 2 "Super Widget"

The second scenario is the whole point: it’s the behavior unit tests usually skip and the AI would otherwise never know to build.

Hand the agent the .feature file and ask for typed step definitions wired to Playwright. How you reference the file differs per tool — @-mentions are a Cursor idiom, not a Claude Code or Codex one.

In agent mode, mention the spec with @features/cart.feature so it’s pinned in context, then paste the prompt. Review the generated cart.steps.ts in the diff view before accepting — Cursor’s inline diff makes it easy to catch a step that quietly stubs an assertion.

With the executable contract in place, have the agent build the implementation — the cart store, the stock guard, and the inline error — until the scenarios can pass.

Naming the stack (Zustand store, typed result, inline message) keeps the agent from inventing a global event bus or a toast library you don’t use. The “do not modify the spec or weaken assertions” guardrail is the single most important line in BDD-with-AI — see “When this breaks” below.

Now let the agent close the loop against real output. This is where the three tools genuinely diverge: who runs the command.

Agent mode can run npm run test:bdd in the integrated terminal and read the output, but for a long red-to-green loop, the background agent is better: hand it the suite and let it iterate while you keep working, then review the final diff.

BDD-with-AI fails in specific, recognizable ways. Watch for these.

  • Test-Driven Development — the unit-level red/green loop that pairs with these behavior-level scenarios.
  • End-to-End Testing — go deeper on driving Playwright with an AI agent, including selectors and flake control.
  • PRD → Plan → Todo — turn a product spec into the acceptance criteria your .feature files encode.