E2E policy — required + agent runs browser tests before merge

Scorecard question: What’s the E2E requirement for UI changes? Max‑score answer (3 pts): E2E required + agent runs browser tests itself before merge.

Why this matters in 2026 (AI UI breakage defect class)

Unit tests verify code. Type checks verify code. Linting verifies code. None of them verify the feature. Only a real browser, driving a real page, with a real user flow, can tell you whether the thing the customer touches actually still works. In 2026 that gap is no longer a nuance — it is the dominant defect class on AI‑authored pull requests.

The pattern is well documented across teams logging post‑merge regressions through a year of agent‑heavy development: the diff looks plausible, the unit tests pass, the type checker is green, the LGTM lands — and then a user clicks the button and nothing happens, because the agent renamed a data-testid, dropped an onClick while refactoring a wrapper, swapped an <a> for a <button> with no href, or moved a modal behind a stacking context that now sits under the cookie banner. The code “looks right”. The UI is broken. This is the AI UI breakage defect class, and it is the single biggest reason teams that adopted Cursor, Claude Code, or Codex in 2024–2025 saw their post‑merge incident rate go up before it went down.

The reason is structural. An LLM that writes code can read code. It cannot, by default, see the rendered page. It does not know that the new flexbox container made the submit button overlap the privacy notice on mobile, that the new state machine left the dialog stuck open on the second click, or that the API response shape it expected matches the staging fixture but not the actual /me endpoint in production. The only check that knows any of that is one where a browser actually renders the page and another agent (or a test harness, or both) verifies the rendered result.

By 2026 the bar moved twice past “we have E2E tests, but only QA runs them, nightly” (the 2024 1‑point answer): first to “E2E runs on every PR in CI” (2 points), then to “the agent that wrote the PR runs the browser checks itself before asking for review” (3 points). Anything less and you are paying for AI authorship speed with the equivalent in cleanup, incident response, and lost user trust. If you scored 0 or 1, the tell is in your incidents: front‑end regressions disproportionately come from PRs that touched UI but had no E2E on the affected flow, and the team reaction is “we’ll add a test” rather than “the agent should have caught that before opening the PR”.

What “max score” actually looks like

Every UI‑touching PR runs E2E in CI as a required check. GitHub branch protection lists at least one Playwright (or Cypress) suite as required. PRs cannot merge if it’s red. No “we’ll re‑run after merge” override.
The agent runs browser checks itself before opening the PR. Inside the IDE/CLI session, the agent drives a real browser via Playwright MCP, chrome‑devtools MCP, or browser‑use, asserts the flows, captures screenshots, and only then commits. Failures stay in the agent’s loop, not your review queue.
Tests assert behaviour and accessibility, not selectors. Playwright’s getByRole/getByLabel (or Cypress equivalent) means the test still passes when the agent renames a class — and fails when the agent removes the accessible name entirely. Tests live close to intent (“user can complete checkout”), not markup.
Visual regression covers surfaces where pixels matter. Chromatic, Percy, Argos, or Playwright’s toHaveScreenshot is wired into the same PR check, with reviewers approving visual diffs explicitly. The agent knows which surfaces are visual‑critical (pricing, checkout, dashboard chrome) and re‑shoots baselines on intentional changes.
Flake budget enforced and visible. Flaky tests have an owner, a deadline, and retries bounded to one or two attempts. A spec failing intermittently more than twice in a week is auto‑quarantined with a ticket — it does not silently rot in the suite.
The agent has a browser‑test skill. tests/e2e/CLAUDE.md (or equivalent for Cursor and Codex) tells the agent which suite to run, headed vs headless, how to authenticate the test user, which surfaces need visual diffs. The agent doesn’t reinvent wiring each session.
Pre‑merge logs are part of the PR. The agent attaches its browser run output — screenshots, traces, network logs — to the PR body, so reviewers don’t re‑run anything to know what happened.

Concretely: a developer asks the agent to change the signup form copy. The agent edits the component, runs npm test, runs npx playwright test signup.spec.ts against a local dev server, watches the suite go green (with a screenshot in the PR), and only then opens the PR. The required check re‑runs the suite in CI against a preview deploy, plus Chromatic flags the copy diff for one‑click visual approval. Agent‑side time: 90 seconds. Human review time: 60 seconds.

Current landscape (web‑search‑verified)

Playwright MCP + agent integration

Microsoft shipped Playwright MCP in early 2025 — an MCP server that hands a live Playwright browser session to any MCP‑capable agent (Claude Code, Cursor, Codex CLI, GitHub Copilot Workspace). Instead of running pre‑written scripts, the agent receives an intent — “complete the signup flow and verify the welcome modal” — inspects the page’s accessibility tree, and dispatches Playwright actions (click, fill, expect) against ARIA roles rather than CSS selectors. By early 2026 this became the default way teams bootstrap 70–80% of new E2E coverage. The same MCP is used in two modes: at authorship (agent writes a test for the feature it just built) and at verification (agent runs the existing suite before opening the PR). The killer property is that intent stays stable across UI rewrites — when the agent reshuffles the markup, the accessibility tree still anchors the test.

chrome-devtools MCP

Google’s chrome‑devtools MCP exposes the Chrome DevTools Protocol — performance traces, Lighthouse audits, network inspection, console errors, layout shift detection — to the same agents. Where Playwright MCP is for driving the browser, chrome‑devtools MCP is for observing it. Together they answer questions pure functional E2E cannot: did this PR add a layout shift? regress LCP on pricing? does the console now spew CSP violations? In 2026 the canonical setup is both MCPs in parallel, with the agent invoking chrome‑devtools after the Playwright run on any PR touching a critical page.

Cypress + GitHub Actions

Cypress remained dominant for teams that adopted E2E pre‑2024 and built sizeable suites. In 2026 Cypress shipped cypress.io/ai and a GitHub Actions integration mirroring the Playwright pattern: agents can generate specs, run them on every PR, parallelize across containers, and post results back as a GitHub Check. The decision tree — keep Cypress if you have an investment; choose Playwright MCP for tighter agent integration if starting fresh.

Visual regression tools (Chromatic, Percy, Argos)

Chromatic (Storybook‑aligned), Percy (BrowserStack), and Argos (open‑source‑friendly) all integrate as GitHub Checks that flip red when a UI diff exceeds threshold. By 2026 Argos shipped first‑class Playwright support and an agent‑driven workflow where the agent itself can mark a diff as expected (with justification) versus regression. Common pattern: visual regression runs on every PR as “advisory” until a reviewer (or senior agent) approves or rejects. Critical surfaces (pricing, checkout, signup) promote to required.

Including E2E in the PR gate

Branch protection on your default branch lists the E2E job as a required status check. PRs cannot merge if it’s red; --admin overrides are reserved for emergencies and audited. The combination of “agent runs the suite locally before pushing” + “CI re‑runs the suite on the preview deploy” + “required check on merge” closes the loop — the agent doesn’t get to assume CI will catch its mistakes, and CI doesn’t get bypassed when the agent claims it tested locally.

Step‑by‑step: rolling out agent‑driven E2E

Pick one test runner and one user flow to start. Don’t try to convert the whole suite. Pick the single most important flow — signup → first‑run for SaaS, browse → checkout for e‑commerce. Pick one runner: Playwright if starting fresh, Cypress if you have an investment. The cost of inconsistency in 2026 is higher than the cost of picking the slightly wrong tool.
Write the first three tests with the agent, in pair‑coding mode. Open Playwright MCP (or Cypress’s AI mode) and ask the agent to generate a spec for the chosen flow. Sit next to it for the first three: the agent will overfit to selectors the first time, miss flaky waits the second, and ignore accessibility queries the third. By the fourth, it has internalized your conventions.
Anchor on roles and accessible names, not CSS selectors. The single biggest determinant of suite survival under AI authorship is whether tests query getByRole('button', { name: 'Sign up' }) instead of [data-testid="signup-btn"] or .btn.btn-primary.mt-4. Add a lint rule that rejects raw CSS selectors in spec files. Bonus: this drives accessibility hygiene — a button without an accessible name fails the test and fails screen readers.
Wire the suite into CI as a required check. Add a workflow that runs the suite on every PR against a preview deploy or Docker‑composed local stack. Parallelize across 4–8 shards to keep total wall‑clock under 5 minutes for a 200‑spec suite. In GitHub branch protection, mark the job required before the suite is comprehensive — a small required suite that always runs beats a large suite that’s “advisory”.
Add a tests/e2e/CLAUDE.md (and equivalents) so the agent knows the wiring. Document: how to run the suite (npm run test:e2e), how to start the dev server, how to authenticate the test user, which env vars are needed, which surfaces require visual diffs, what the flake policy is. Codex reads AGENTS.md, Cursor reads .cursor/rules/, Claude Code reads CLAUDE.md.

Before opening a PR that touches src/components/ or src/pages/, run npm run test:e2e against a local dev server. Anchor every assertion on getByRole/getByLabel, never on CSS selectors or data-testid. If a spec fails, fix the cause — do not skip, retry-loop, or rewrite the selector to make it pass. When the suite is green, attach the run output (screenshots + trace) to the PR body. Pricing, signup, and dashboard chrome are visual-critical: re-shoot baselines only on intentional changes.
Install Playwright MCP and chrome‑devtools MCP for the agent. Both ship as local stdio servers you run via npx. For Claude Code: claude mcp add playwright -- npx @playwright/mcp@latest and claude mcp add chrome-devtools -- npx chrome-devtools-mcp@latest. (Cursor and Codex point at the same npx commands in their MCP config.) Verify the agent can list MCP tools and drive a sample page. Add an example session to onboarding showing the agent open the dev server, run a flow, capture a screenshot, and assert visible text.
Add visual regression on the top three pages. Wire Chromatic / Percy / Argos to pricing, signup, and dashboard chrome. Start with permissive thresholds and advisory checks. After two weeks, tighten and promote to required. Tell the agent which pages are visual‑critical so it re‑shoots baselines on intentional changes (and not on incidental ones).
Stand up a flake registry and an owner. Create tests/e2e/QUARANTINE.md listing every quarantined spec, its failure mode, the assigned engineer, and the deadline. The suite owner is a single named human, rotating quarterly. Anything in quarantine more than two weeks gets deleted or fixed. There is no third option.
Measure agent‑side run rate and pre‑merge catch rate. Two metrics: (1) fraction of UI‑touching PRs with evidence the agent ran the suite before pushing, and (2) fraction of E2E failures in agent’s local run vs CI vs production. Goal: >80% agent‑side evidence, >70% pre‑push catches, <5% slip to production. Surface on the Q22 · AI metrics panel.
Iterate the agent skill quarterly. Review failure modes and update tests/e2e/CLAUDE.md with lessons. New flaky pattern? Add the wait helper. New visual‑critical surface? Add to the list. The skill compounds — every lesson you bank means the next session pays a lower tax.

Common pitfalls

Flaky tests treated as the test’s fault. A spec that times out intermittently is almost never “the test is wrong”; it’s an unhandled race, a missing wait‑for, an animation that never settles, or a backend that returns 200 with an empty body. Quarantine is a holding pattern, not a fix. Investigate within two weeks or delete the spec — letting flake fester teaches the team to ignore red CI, which destroys the gate.
Skipping E2E for “small UI changes”. The most regression‑prone PRs in the AI era are the ones the agent rates “small”: a copy tweak that broke a label query, a CSS refactor that moved a button behind another element, a “tiny” prop rename the type checker missed because the consumer used any. The gate is unconditional on UI paths. The agent doesn’t vote on whether a UI change is small enough to skip.
No headless config / works‑on‑my‑machine. Agent passes locally because Chrome is open at 2560×1440 with the user’s cookies; CI fails because the headless instance is 1280×720 with no auth. Pin the viewport, run headed and headless locally, share fixtures between modes, default the agent to headless.
Selector‑based tests that the agent’s first refactor will break. If your suite queries .css-1abc2de or [data-testid="x"] and the agent rewrites a wrapper component, half the suite goes red overnight — and the agent will “fix” the tests by updating selectors instead of preserving the user‑visible contract. Convert to roles and accessible names before you turn the agent loose.
One enormous test that does everything, or no retry budget. Monolithic specs (login → onboard → create → invite → upload → billing) compound flake: 1% per step becomes a 6% suite failure rate. Split into one spec per user goal. Pin retries to one in CI and zero locally — five retries hides real flakiness; zero punishes legitimate transient failures.
The agent runs nothing locally and just opens the PR. Without explicit instruction, agents skip costly tooling. Fix it with the skill (step 5) plus a pre‑push hook that runs npm run test:e2e:fast (a smoke subset) on any change under src/components/ or src/pages/.
Visual regression set to “advisory forever”. A check that exists but never blocks teaches the team to ignore it. Once your three critical surfaces are stable, promote the check to required. Reviewers will whinge for a week and then settle.
CI parallelization that breaks test isolation. Running 8 shards against one shared database means tests stomp on each other’s state. Ship per‑shard databases (Docker compose, ephemeral SQLite, neon/supabase branching) or run sharded suites against an in‑memory backend.

How to verify you’re there

A random sample of UI‑touching PRs from the last 30 days: >80% contain evidence (screenshot, trace, log) that the agent ran the suite before pushing.
Branch protection on your default branch lists at least one Playwright (or Cypress) job and one visual‑regression job as required status checks.
Last 90 days of production front‑end incidents: <5% trace to a defect an E2E spec covering the affected flow would have caught.
The agent, in a fresh session, can state its E2E responsibility in one sentence (the standing instruction below).
tests/e2e/QUARANTINE.md has zero entries or every entry has an owner and a deadline less than two weeks out.
The most recent agent‑authored UI PR’s body contains screenshots from the agent’s local run and a link to the CI Playwright report.
A new engineer can run the full E2E suite locally in under 10 minutes from clone to green, using only tests/e2e/CLAUDE.md and the project README.
The AI metrics panel (Q22) shows pre‑merge catch rate >70% trending up, and post‑merge UI incident rate trending down quarter‑over‑quarter.