Agent-driven E2E — the agent runs the browser before saying 'done'

Scorecard question: How do you E2E-cover “the feature works in a browser”? Section: Parallelism & automation. Max-score answer (3 pts): The agent itself runs E2E (Playwright MCP, browser-harness) and verifies the feature.

Why this matters in 2026

Type-checks and unit tests verify code. They do not verify features. When you finish a UI change and the only thing standing between “diff written” and “shipped” is npm run type-check && npm test, you’ve proven nothing about what a human visiting your page will actually see. The button might be there, wired to the right handler, with a passing unit test on the handler — and still produce a blank screen because the Astro hydration order changed, or a stray CSS rule moved it 600px to the right, or the dropdown that drives it is now opening behind a modal. Every developer who’s shipped a UI knows this category of “passed all my checks, broke in production” bug. Through 2024–2025, the answer to it was “well, write more Playwright tests” — which most teams didn’t, because hand-authoring E2E suites was tedious, the tests were flaky, and they slowed PRs down. In 2026 that calculation flipped. The agent that just wrote the diff can also drive a real browser, click the thing it just shipped, read the rendered page, and confirm the feature works — in seconds, on every change, without anyone authoring a test file.

The shift is structural, not cosmetic. Playwright 1.56 (October 2025) shipped Test Agents and a hardened Playwright MCP server, which means the same agent loop that edits your TypeScript can now navigate to localhost:4321, open the new modal, and verify the submit button is visible. browser-harness, agent-browser, and chrome-devtools MCP added similar surfaces for CDP-level control without writing a full test suite. The result: the cheapest possible E2E layer is no longer “write a Playwright spec and run it in CI” — it’s “ask the agent to verify the feature works before reporting done”. Teams that internalized this in 2026 cut a category of regressions to near zero and stopped needing the elaborate hand-authored E2E suites that used to consume a quarter’s worth of QA cycles. Q18 is the question that measures whether you’ve made that jump. Scoring max means the agent itself drives a browser as part of finishing a UI task — not “we have a Playwright suite that runs in CI someday”, and definitely not “I’ll click around manually before merging”.

What “max score” actually looks like

A max-score Q18 setup has three layered properties. First, the agent doing your UI work has access to a real browser through one of the supported surfaces — Playwright MCP, browser-harness, agent-browser, or chrome-devtools MCP wired in via .mcp.json (project) or ~/.claude.json (user scope) / ~/.codex/config.toml. Second, the agent uses that surface as a normal step in finishing a UI task, not as a separate ritual. After it changes a component, it navigates to the dev server, takes a screenshot or accessibility snapshot, confirms the rendered output, and only then reports done. Third — and this is the step that separates 2 pts from 3 — that verification is enforced by a Stop hook or a custom slash command, so it cannot be skipped on a tired Friday afternoon. The agent literally cannot say “done” without producing evidence (a screenshot, a passing assertion, a quoted DOM snippet) that the feature renders.

A typical maxed-out flow on a Claude Code project with Playwright MCP looks like this. The engineer says “add a confirmation dialog to the delete button on the settings page”. The agent reads the component, writes the change, runs npm run type-check && npm test. So far this is the same as 2 pts. Then — because a Stop hook checks “did you touch any .astro / .tsx / .jsx file under src/?” and the answer is yes — the hook injects a verification step. The agent calls browser_navigate("http://localhost:4321/settings"), takes an accessibility snapshot, calls browser_click({ ref: "delete-button" }), confirms the dialog text appears in the next snapshot, and only then writes a summary that includes the snapshot. The whole loop adds 10–20 seconds. It catches the entire class of bugs where the unit test passes but the rendered page is broken — hydration ordering, CSS regressions, missing imports, Astro island misconfigurations, accidental display:none. It also produces a screenshot or snapshot you can paste into the PR description, which doubles as documentation for the reviewer.

Compare that to lower tiers. 0 pts: no E2E at all — you ship if the unit tests pass and merge if it “looks fine on the staging URL”. 1 pt: a hand-authored Playwright/Cypress suite that runs in CI, but the agent doesn’t touch it and verification is “wait for CI to go green”. 2 pts: the agent can run Playwright when explicitly asked, but verification isn’t part of the default loop — you have to remember to say “now also verify in the browser”, and 4 times out of 10 you forget. 3 pts: the agent verifies by default, enforced by hooks; the diff is never reported done without browser evidence.

Current landscape (web-search-verified)

The “agent drives the browser” surface in 2026 is a layered stack. At the bottom is Chrome DevTools Protocol (CDP), the same protocol Puppeteer and Playwright use internally. On top of CDP sit four agent-facing options that matter for Q18, each with a different sweet spot. You do not need all four. You need one wired into your primary agent — but understanding the trade-offs prevents picking the wrong one for your repo.

Playwright MCP (Microsoft official MCP server)

Playwright MCP is the official Model Context Protocol server from Microsoft, shipped as part of the Playwright project and verified at playwright.dev/docs/getting-started-mcp. It exposes the full Playwright API as MCP tools that any MCP-capable agent (Claude Code, Codex CLI, Cursor) can call. The crucial design decision: it does not drive the agent with screenshots. Instead it returns the browser’s accessibility tree as a structured, text-based snapshot — typically 2–5 KB per page versus 100+ KB for a screenshot, which is a 20–50x token cost difference. The agent reads the snapshot, picks a deterministic element reference (ref: "delete-button-7"), and the MCP tool dispatches the click against that exact node. The interaction is selectorless from the agent’s perspective but deterministic on the wire — there’s no fuzzy LLM-vision step in the middle, which keeps it fast, cheap, and reproducible.

Install (Claude Code):

claude mcp add playwright --transport http http://localhost:8931
# or with npx during dev:
npx -y @playwright/mcp@latest

These are two different transports, not two steps. The first attaches Claude Code to an already-running Playwright MCP HTTP server (start it separately with npx @playwright/mcp@latest --port 8931); the second runs the server over stdio for ad-hoc dev. Pick one — you don’t need both.

Or via .mcp.json at the repo root, so the team picks it up on clone (this is the stdio form, launched on demand):

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["-y", "@playwright/mcp@latest"]
    }
  }
}

Once installed, the agent can call browser_navigate, browser_snapshot, browser_click, browser_type, browser_evaluate, etc. The 1.56 release added the three “Test Agents” — Planner, Generator, Healer — but for Q18 the primary use is in-place verification, not test authoring. Ask the agent to verify the page after a UI change; it navigates, snapshots, asserts, reports. If your stack is anywhere near “Microsoft, Playwright, accessibility-first”, this is the default pick.

chrome-devtools MCP

The chrome-devtools MCP server (mcp__plugin_chrome-devtools-mcp_chrome-devtools__* in many Claude Code setups) is the option when you specifically need DevTools-only surfaces: Lighthouse audits, performance traces, memory snapshots, network request inspection at the protocol level. It launches its own headless Chrome instance per agent session and exposes tools like lighthouse_audit, performance_start_trace, list_network_requests, take_memory_snapshot. For a vanilla “did the button work?” verification it’s overkill — Playwright MCP is faster and cheaper — but for “did the new feature regress LCP?” or “is the new analytics script firing on this page?” it’s the right tool. The trade-off: it launches a fresh Chromium per session, which means bot-protected sites (Cloudflare, PerimeterX, Skyscanner) captcha-wall instantly. Use chrome-devtools MCP for your own localhost; use a session-attached browser for anything that needs the user’s logged-in state.

A common gotcha: chrome-devtools MCP can leave stale Chrome instances. If new_page errors with “browser is already running for …/chrome-devtools-mcp/chrome-profile”, run pkill -f "chrome-devtools-mcp/chrome-profile" and retry. Lock this into your Stop hook if you use the tool heavily.

browser-harness (CDP-based custom harness)

browser-harness is a CDP-based harness designed specifically for agent-driven workflows. It attaches to the user’s running Chrome (rather than launching its own), exposes a small Python API (new_tab, wait_for_load, click, screenshot, js, page_info, http_get, cdp), and runs as a long-lived daemon over a unix socket so the agent doesn’t pay startup cost per call. The shape of the tool call is a single bash heredoc with inline Python:

browser-harness <<'PY'
new_tab("http://localhost:4321/settings")
wait_for_load()
click(640, 412)         # coords from a prior screenshot()
screenshot()            # verify the dialog appeared
print(page_info())
PY

The defining strength is that it shares the user’s Chrome session. That’s the difference between getting through Cloudflare’s bot check and getting captcha-walled on the first request — the harness sees the same cookies, the same logged-in state, the same extensions as the human user. For verifying a logged-in flow on staging, or interacting with a third-party that bot-blocks fresh Chromium instances, this is the option. The other strength is that it’s deliberately small — coordinate clicks via Input.dispatchMouseEvent pass through iframes, shadow DOM, and cross-origin boundaries at the compositor level, which is more robust than DOM-based clicks on complex apps. The trade-off: clicking by coordinates is less stable than clicking by accessibility ref when the layout shifts, so Playwright MCP wins on pure-localhost component verification while browser-harness wins on cross-origin and authenticated flows.

agent-browser CLI

agent-browser is a general-purpose browser automation CLI built for AI agents, exposed as a skill (agent-browser skill in many Claude Code setups). It overlaps with browser-harness on the “attached to a real browser” axis but is more CLI-shaped and less Python-API-shaped. Use it when you want to script higher-level operations (“open this page, fill this form, take a screenshot, return JSON”) without writing the imperative steps yourself. The skill’s core workflow handles the discovery of available subcommands; run agent-browser skills get core to read the canonical workflow before driving it. For Q18 specifically, agent-browser is a good secondary surface to combine with Playwright MCP — Playwright for “is the new component on the page”, agent-browser for “did the whole signup flow still work end to end”.

When to ask the agent to verify vs trust the diff

Not every diff needs browser verification, and turning the screw too tight makes the loop annoying. A reasonable policy:

Always verify in browser: changes to .astro / .tsx / .jsx / .vue / .svelte under src/, changes to .css / tailwind.config.* that affect public pages, changes to API routes the frontend calls, changes to the paywall script.
Skip verification: pure backend changes (a DB migration with no UI surface), config-only changes, documentation MDX, internal scripts. The Stop hook should match by path so this is automatic.
Verify on staging instead of localhost: changes that depend on Cloudflare bindings, third-party OAuth, or edge runtime features that don’t run on npm run dev. Use browser-harness against the staging URL with the user’s logged-in session.

The hook decides, not the human. If the path glob matches, the agent verifies; if not, it doesn’t. This is the difference between a discipline and a hassle.

How hard should the check gate?

Browser verification is one check; the deeper question is how strictly the agent is forced to run it before it can stop. Anthropic’s own best-practices guidance frames this as an escalation — pick the weakest rung that fits the task:

In one prompt. Ask the agent to run the check and iterate in the same message. Works on any task today.
As a completion condition. In Claude Code, set the check as a /goal — a separate evaluator re-checks it after every turn and the agent keeps working until it holds.
As a deterministic gate. A Stop hook runs your check as a script and blocks the turn from ending until it passes (the step-by-step below builds exactly this).
By a second opinion. A verification subagent reviews the diff in a fresh context, so the model that wrote the code isn’t the one grading it.

The fourth rung deserves its own habit. A reviewer that sees only the diff and your acceptance criteria — not the reasoning that produced the change — catches what the author rationalizes past:

A reviewer told to find gaps will usually find some even when the work is sound; chasing every one leads to over-engineering. Scope it to correctness and requirements, and treat the rest as optional.

What counts as “verified”

A green test suite is necessary, not sufficient. The bar that catches what tests miss is mergeability: would you actually merge this code? Code that runs but is unmaintainable — duplicated, unnamed, untestable — should fail the check, not pass because the assertions happened to be green. Make that explicit when the agent self-assesses. And close real coverage gaps by handing the agent exactly what is untested, rather than asking vaguely for “more tests”:

Step-by-step: making E2E verification a Stop-hook step

Pick a surface and confirm the agent can drive it. From your repo root, ask your primary agent: “list the browser tools you have available”. If Playwright MCP, chrome-devtools MCP, or browser-harness shows up, you’re set. If nothing browser-related is listed, install Playwright MCP first (claude mcp add playwright ... for Claude Code; equivalent step in ~/.codex/config.toml for Codex; MCP panel in Cursor Agents).
Wire the dev server into a known-stable URL. Confirm npm run dev brings up localhost:4321 (Astro default) reliably. If your project needs Cloudflare bindings, prefer npm run dev:cf (Wrangler + Workers) and adjust the URL accordingly. The agent will hit this URL during verification, so it has to come up reliably from a cold start.

Create a verification slash command. Add .claude/commands/verify-ui.md (or the equivalent in .cursor/skills/):

---
description: Verify the UI change works in a real browser
argument-hint: "[URL path, default /]"
---

Make sure the dev server is running on http://localhost:4321 (start it with
`npm run dev` in a background bash if it isn't).

Navigate to http://localhost:4321$ARGUMENTS using your browser MCP tool
(Playwright MCP preferred, browser-harness as fallback).

Take an accessibility snapshot of the rendered page. Confirm:
- The component I just changed is present and visible.
- No console errors are reported.
- No layout-shift warnings.
- If the change adds an interaction (button, modal, form), trigger it once
  and verify the resulting state.

Output the snapshot summary plus a one-line verdict: SHIPPABLE or BROKEN.

This makes verification a single keystroke even before the hook automates it.

Add the Stop hook. In ~/.claude/settings.json (or .claude/settings.json if it should be repo-scoped):

{
  "hooks": {
    "Stop": [
      {
        "matcher": "",
        "hooks": [
          {
            "type": "command",
            "command": "bash -c 'git diff --name-only HEAD | grep -E \"\\.(astro|tsx?|jsx?|vue|svelte|css)$\" -q && echo \"UI files changed — run /verify-ui before reporting done\"'"
          }
        ]
      }
    ]
  }
}

The hook fires on every Stop. If UI files changed, it injects a reminder for the agent to run /verify-ui before the turn ends. The first version is intentionally a reminder, not a hard gate — let the agent decide which path to verify. Tighten it once the rhythm is in place.

Run the loop on a real ticket. Pick a small but real UI ticket (a button label change, a new form field, a dialog tweak). Let the agent edit, type-check, and test. Watch it hit the Stop hook, run /verify-ui /the-affected-page, and produce a snapshot. Read the snapshot. Does it actually confirm the change? If yes, you’ve maxed Q18 on this repo. If no, the slash command needs sharpening — usually a better assertion on “the new element is present with the new text”.
Capture the screenshot/snapshot in the PR. Pipe the snapshot output into the PR body. Reviewers stop asking “did you actually test this?” because the answer is right there. This is the second-order productivity win — verification becomes documentation.
Add the second tier: cross-origin / authenticated flows. Once localhost verification is automatic, add a /verify-staging command that hits your staging URL via browser-harness with the user’s logged-in Chrome session. Use it for changes that touch auth, payments, or third-party integrations.
Tighten the hook over time. When the team is comfortable, escalate the Stop hook from “reminder” to “blocker” — exit code 2 with a STOP: verify in browser before finishing message. The agent treats this as a hard signal to call /verify-ui before the turn ends. Don’t escalate before the muscle is built; you’ll just annoy people into disabling the hook.
Wire CI to the same Playwright commands. Once the agent is producing reliable verification locally, the same Playwright MCP commands can be promoted to a npm run test:e2e step in CI. The CI version doesn’t replace the agent verification — it backs it up for the cases where someone bypasses the hook or commits from a different machine.
Quarterly retro. Every quarter, look at the bugs that did slip into production. How many were “UI broke despite passing tests”? In a maxed-Q18 setup, this category should be near zero. If it isn’t, the verification slash command is missing an assertion — usually around interactivity (the element renders but the click handler is wrong) or responsiveness (it works at desktop width but breaks on mobile).

Common pitfalls

Mocking the whole frontend in unit tests and calling that “covered”. A 90%-coverage Vitest suite that mocks the router, the API client, and the global store proves the unit logic works in isolation but proves nothing about whether the assembled page renders. Browser verification is not a substitute for unit tests; it’s the layer above them. Both are needed.
Hand-authoring Playwright tests instead of letting the agent verify. Through 2024–2025 the standard advice was “write more Playwright tests”. In 2026 with Test Agents and MCP, hand-authoring is the slow path. Author tests for the flows that warrant permanent CI coverage (signup, checkout, login). Let the agent verify everything else ad-hoc on every change.
Driving a fresh Chromium against bot-protected sites. Playwright MCP and chrome-devtools MCP launch fresh browsers, which trip Cloudflare/PerimeterX/Skyscanner-style bot walls instantly. If your verification needs a real user session, use browser-harness or attached Playwright — they share the user’s existing Chrome session.
Coordinate clicks from a stale screenshot. browser-harness uses viewport coordinates, not screenshot pixels. When screenshot() returns a downscaled image, scale up before click(x, y) or skip coords entirely and click via js("document.querySelector('button').click()"). This is the most common “verification said success but the click didn’t happen” failure mode.
No wait_for_load() after navigation. Async hydration means the DOM is mounted but interactivity isn’t wired up yet. Always call wait_for_load() (browser-harness) or browser_wait_for (Playwright MCP) before asserting on interactive state, otherwise you’ll get false negatives that look exactly like real bugs.
Letting the agent click random things to “verify”. The verification slash command needs concrete assertions: “the button labeled ‘Confirm’ is visible”, “the modal contains the text ‘Are you sure?’”, not “the page looks fine”. Without explicit assertions, the agent will report success on a blank page.
Skipping verification on “trivial” changes. “It’s just a color change” is the most common precursor to a production CSS regression. The Stop hook should fire on any UI file change, full stop. Two seconds of verification beats two hours of post-mortem.
Treating Q18 as “we have Cypress in CI”. A CI E2E suite is good but doesn’t max Q18. The question is specifically about whether the agent verifies during the work — the feedback loop that catches the bug before it hits CI, before it hits a reviewer, before it hits production.

How to verify you’re there

Your primary agent has at least one browser-driving surface installed and listed in its tools (Playwright MCP, browser-harness, agent-browser, or chrome-devtools MCP).
A Stop hook (or equivalent in Codex / Cursor) fires on UI file changes and reminds the agent to verify.
A /verify-ui slash command exists and runs reliably on a cold-start dev server.
The last five UI PRs in your repo contain a screenshot or accessibility snapshot in the PR body.
You can name a regression in the last quarter that browser verification caught before merge — and zero regressions that “passed all checks then broke in production”.
New hires get the same verification flow on git clone — the slash command and the hook are both in the repo, not just in your personal ~/.claude/.
The agent says “I verified this in the browser at /settings and the dialog appears as expected” without being asked.