Behavior-Driven Development Workflows

The checkout feature passes every unit test and still ships the wrong thing: the acceptance criteria lived in a Jira ticket nobody fed to the agent, so the AI optimized for “tests are green” instead of “the user can actually buy the thing.” Behavior-Driven Development closes that gap by making the executable spec — Gherkin scenarios — the source of truth the agent codes against.

AI assistants are unusually good BDD partners because the whole methodology is natural-language-first. You define the behavior in Given/When/Then, the agent generates the step definitions and implementation, and the BDD runner is the objective gate that tells both of you whether the behavior is real.

What you’ll walk away with

A repeatable loop that turns a .feature file into step definitions, implementation, and a passing suite.
Copy-paste prompts for generating Cucumber.js step definitions, implementing against the contract, and driving the suite to green.
The three-tool execution paths (Cursor, Claude Code, Codex) for handing the agent the .feature file and running the runner.
The failure modes that quietly defeat AI-driven BDD — the agent editing the spec to make tests pass, flaky async steps — and how to fence them off.

The AI-Powered BDD Cycle

The BDD workflow with an AI assistant follows a clear, collaborative cycle that ensures the final product matches the desired user experience.

1. Define Behavior (Given/When/Then)

You start by writing a feature specification using the Gherkin syntax (Given, When, Then). This describes a user scenario in plain English. You can even collaborate with the AI to refine these user stories.

2. Generate Step Definitions

You provide the .feature file to the AI and ask it to generate the corresponding step definition files for your testing framework (e.g., Cucumber, Behat, SpecFlow). These will be initially empty or contain placeholder code.

3. Implement the Feature

With the behavioral contract in place, you instruct the AI to write the application code necessary to make the scenarios in the .feature file pass.

4. Run, Verify, and Refactor

Finally, the AI runs the BDD tests. It will analyze any failures and iterate on the application code until all scenarios pass. This creates a tight feedback loop that is driven directly by the user-facing behavior.

A Practical BDD Workflow

Let’s drive a real rule end to end: a stock-limited “Add to Cart” for a Next.js storefront, tested with Cucumber.js + Playwright against the running app. The interesting behavior isn’t “the button works” — it’s the boundary: you can’t add more units than are in stock. That edge case is exactly what gets lost when criteria live in a ticket instead of a spec.

Step 1 — Write the Gherkin feature file

Encode the rule, including the failure path, in features/cart.feature:

Feature: Stock-limited cart
  As a shopper
  I want the cart to respect available stock
  So that I can't order items the warehouse doesn't have

  Background:
    Given the product "Super Widget" has 2 units in stock

  Scenario: Adding within the stock limit
    Given I am on the product page for "Super Widget"
    When I add 2 of "Super Widget" to my cart
    Then my cart should contain 2 "Super Widget"

  Scenario: Blocked from exceeding the stock limit
    Given I am on the product page for "Super Widget"
    When I add 2 of "Super Widget" to my cart
    And I try to add 1 more "Super Widget"
    Then I should see "Only 2 in stock"
    And my cart should contain 2 "Super Widget"

The second scenario is the whole point: it’s the behavior unit tests usually skip and the AI would otherwise never know to build.

Step 2 — Generate step definitions

Hand the agent the .feature file and ask for typed step definitions wired to Playwright. How you reference the file differs per tool — @-mentions are a Cursor idiom, not a Claude Code or Codex one.

Read features/cart.feature. Generate Cucumber.js step definitions in
features/steps/cart.steps.ts using @cucumber/cucumber and Playwright
(@playwright/test) to drive the running Next.js app at http://localhost:3000.

- One step function per unique Given/When/Then; share the Playwright `page`
  via the World, not module globals.
- Implement the "N units in stock" Background by seeding via the test API
  route POST /api/test/seed, not by mocking inside the step.
- Leave a `// TODO: assert` only where the selector is genuinely unknown;
  everywhere else, write the real Playwright assertion.
- Do NOT edit cart.feature.

In agent mode, mention the spec with @features/cart.feature so it’s pinned in context, then paste the prompt. Review the generated cart.steps.ts in the diff view before accepting — Cursor’s inline diff makes it easy to catch a step that quietly stubs an assertion.

Reference the path in plain text (Read features/cart.feature) — there’s no @-mention affordance in the terminal. Claude Code reads the file with its own tools and can scaffold the steps directory in one turn. Keep it in the same session so it remembers the World setup when you move to implementation.

Run in the IDE or CLI surface on the worktree that has the feature file. Point it at the path in the prompt; Codex opens the file itself. On a clean worktree you can let it write the steps directory and the cucumber.js config in one task.

Step 3 — Implement against the contract

With the executable contract in place, have the agent build the implementation — the cart store, the stock guard, and the inline error — until the scenarios can pass.

Using features/cart.feature as the contract, implement the stock-limited
cart in the Next.js app:
- a cart store (Zustand) with an addItem(id, qty) that rejects quantities
  beyond available stock and returns a typed result;
- the "Only N in stock" message rendered inline on the product page;
- the POST /api/test/seed route used by the Background.
Touch only app/, components/, lib/cart/, and the test seed route.
Do not modify cart.feature or weaken any assertion in cart.steps.ts.

Naming the stack (Zustand store, typed result, inline message) keeps the agent from inventing a global event bus or a toast library you don’t use. The “do not modify the spec or weaken assertions” guardrail is the single most important line in BDD-with-AI — see “When this breaks” below.

Step 4 — Run the suite and iterate

Now let the agent close the loop against real output. This is where the three tools genuinely diverge: who runs the command.

Run `npm run test:bdd` (cucumber-js). For each failing scenario, read the
Cucumber output, identify the failing step, and fix the application code or
selector — never the .feature file. Re-run until all scenarios pass. If a
step fails because the behavior is genuinely ambiguous, stop and ask me
instead of guessing.

Agent mode can run npm run test:bdd in the integrated terminal and read the output, but for a long red-to-green loop, the background agent is better: hand it the suite and let it iterate while you keep working, then review the final diff.

This is Claude Code’s strongest surface: it runs the suite with the Bash tool, parses the human-readable Cucumber report, and iterates in-session. Scope it with --allowedTools (e.g. Edit, Bash) so it can run tests and patch code but you stay in control of anything else.

Run the suite from Codex CLI/IDE in a git worktree you created, or kick it to Codex Cloud for a longer unattended red-to-green pass and review the resulting PR. For local work use --sandbox workspace-write -c approval_policy=on-request: routine suite runs and app-code edits are allowed by the sandbox, while Codex can ask before crossing its boundary.

When this breaks

BDD-with-AI fails in specific, recognizable ways. Watch for these.

The agent edits the spec to make tests pass. This is the cardinal sin — it turns the contract into a rubber stamp. Keep *.feature files out of the agent’s write scope (or in a protected path) and assert in review that the diff doesn’t touch them. The guardrail line in every prompt above exists for this.
It weakens assertions instead of fixing code. Subtler than editing the feature file: it changes should contain 2 to should contain at least 1, or comments out the “Only 2 in stock” check. Diff the step definitions, not just the app code.
Flaky async steps. Cucumber.js + Playwright steps that race the UI pass locally and fail in CI. Push the agent toward Playwright’s auto-waiting locators and web-first assertions (await expect(locator).toHaveText(...)) instead of fixed setTimeout/page.waitForTimeout waits.
Ambiguous Gherkin produces ambiguous code. “Then the cart should be correct” gives the agent nothing to bind to. If a step has no observable assertion, the spec is the bug — tighten the Gherkin before blaming the implementation.
Step-definition sprawl. The agent writes a near-duplicate step for every phrasing. Periodically ask it to consolidate steps and extract shared setup into the World or a Background.

What’s next

Test-Driven Development — the unit-level red/green loop that pairs with these behavior-level scenarios.
End-to-End Testing — go deeper on driving Playwright with an AI agent, including selectors and flake control.
PRD → Plan → Todo — turn a product spec into the acceptance criteria your .feature files encode.