Load Testing and Performance Analysis

Your checkout endpoint holds up fine in staging, then falls over on the first marketing push. p95 latency climbs from 120ms to 4s, Postgres connections max out, and nobody can point to the change that did it. You need a load test that reproduces the spike, a read on which layer saturates first, and a regression gate in CI so the next deploy doesn’t surprise you at 9pm.

This guide shows how to drive that loop with AI: generate a realistic k6 script, run it, have the model read the output and the slow-query log, and wire a threshold gate into the pipeline. The same prompts work in Cursor, Claude Code, and Codex — the only thing that differs is how you invoke each tool.

What You’ll Walk Away With

A copy-paste prompt that generates a runnable k6 script with a ramping VU profile and thresholds for a specific endpoint
The official k6 MCP server (grafana/mcp-k6) wired into all three tools, plus k6 x agent init to bootstrap it in one command
A prompt that turns a k6 summary plus a Postgres slow-query log into a ranked, fix-first bottleneck list
A Playwright + Lighthouse prompt that measures Core Web Vitals (LCP, INP, CLS) under concurrent load
A CI regression gate that fails the build when p95 regresses, using k6 run exit codes
The failure modes that make load-test numbers lie — and how to catch them before you trust the dashboard

Wire Up the Tooling

The piece worth installing is the official k6 MCP server from Grafana (grafana/mcp-k6). It lets the agent validate scripts, run tests, and read k6 docs without you copy-pasting CLI output back and forth. It ships as a Go binary or Docker image — not an npm package, so ignore any npx-based “k6 MCP” you find; those are unofficial wrappers.

The fastest path is k6 x agent init (requires k6 v2.0+ on your PATH), which drops the right skill files into your editor and registers the MCP server for whichever tool you’re using:

# Bootstrap k6 skills + MCP config for one editor...
k6 x agent init claude-code   # or cursor, codex

# ...or wire up every supported editor at once
k6 x agent init --all

If you’d rather wire the MCP server up by hand, the install is identical across tools — only the registration command differs.

Install the binary, then add it in Settings → MCP → Add with mcp-k6 as the command (stdio). Or drop this into .cursor/mcp.json:

{
  "mcpServers": {
    "k6": { "command": "mcp-k6" }
  }
}

Install the binary first: brew tap grafana/grafana && brew install mcp-k6 (or docker pull grafana/mcp-k6:latest).

brew tap grafana/grafana && brew install mcp-k6
claude mcp add --scope user --transport stdio k6 -- mcp-k6

# No local binary? Run it through Docker instead:
claude mcp add --scope user --transport stdio k6 -- docker run --rm -i grafana/mcp-k6

Add it to ~/.codex/config.toml:

[mcp_servers.k6]
command = "mcp-k6"
# or run via Docker:
# command = "docker"
# args = ["run", "--rm", "-i", "grafana/mcp-k6"]

Install the binary with brew install mcp-k6 after brew tap grafana/grafana.

For browser-side performance you’ll also want the Playwright MCP (@playwright/mcp), which is the same setup in every tool:

claude mcp add playwright -- npx -y @playwright/mcp@latest

In Cursor add it via Settings → MCP; in Codex add a [mcp_servers.playwright] block pointing at the same npx command.

The Workflow

The loop is the same in every tool: generate a script, run it, feed the results back for analysis, then gate it in CI. What changes is the invocation.

Generate the script from a real endpoint, a target VU count, a ramp profile, and explicit thresholds — not “write a load test.”
Run it against staging (k6 run script.js), with the MCP server letting the agent execute and read output directly.
Analyze the results by handing the k6 summary plus the slow-query log to the model and asking for a ranked bottleneck list.
Gate it in CI so a p95 regression fails the build before it ships.

Here is how you kick off step 1 and 2 in each tool. The prompt is identical — paste the load-test generation prompt below into any of them.

Open the agent panel (Cmd+I), paste the prompt, and let it write tests/load/checkout.js. With the k6 MCP server connected, the agent can run k6 run itself and iterate on threshold failures inline. Use a checkpoint before the run so you can roll the script back if the generated VU profile is unrealistic.

claude "Generate a k6 load test at tests/load/checkout.js for POST /api/checkout:
ramp 0->200 VUs over 2m, hold 5m, ramp down 1m. Add thresholds:
http_req_duration p95<300 p99<800, http_req_failed rate<0.01. Then run it
against $STAGING_URL and summarize threshold pass/fail."

With the k6 MCP server registered, Claude Code runs the test through the server and reads the summary back without you copy-pasting.

Run it as a one-shot from the CLI, or kick it off in Codex Cloud so the long-running test executes off your machine:

codex --sandbox workspace-write -c approval_policy=on-request \
"Generate a k6 load test at tests/load/checkout.js for POST /api/checkout
(ramp 0->200 VUs over 2m, hold 5m), add p95<300/p99<800 thresholds, run it
against $STAGING_URL, and report which thresholds failed."

For a trusted unattended CI run, use codex exec --sandbox workspace-write -c approval_policy=never and grant staging network access deliberately in the isolated CI configuration. never suppresses prompts but does not expand the sandbox; use least-privilege staging credentials and never point this job at production.

A generated script should come out looking roughly like this — concrete VU stages and thresholds, not a happy-path skeleton:

import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '2m', target: 200 }, // ramp to peak
    { duration: '5m', target: 200 }, // hold
    { duration: '1m', target: 0 },   // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<300', 'p(99)<800'],
    http_req_failed: ['rate<0.01'],
  },
};

export default function () {
  const res = http.post(`${__ENV.STAGING_URL}/api/checkout`, JSON.stringify({
    cartId: 'c_load_test', paymentMethod: 'pm_test',
  }), { headers: { 'Content-Type': 'application/json' } });

  check(res, { 'status is 200': (r) => r.status === 200 });
  sleep(1); // think time so 200 VUs != 200 RPS
}

k6 run exits non-zero when a threshold fails, which is exactly what makes the CI gate in the last step trivial.

Copy-Paste Prompts

Wire the Regression Gate Into CI

Because k6 run returns a non-zero exit code when any threshold is breached, the CI step is just running the script — no custom comparison logic needed. Use the official grafana/setup-k6-action, not a hand-rolled curl install:

- uses: actions/checkout@v5
- uses: grafana/setup-k6-action@v1
- run: k6 run tests/load/checkout.js
  env:
    STAGING_URL: ${{ secrets.STAGING_URL }}
    TOKEN: ${{ secrets.LOAD_TEST_TOKEN }}

To go further, connect the Sentry MCP so the agent can correlate threshold failures with the errors and slow transactions Sentry recorded during the run. The remote server is the simplest setup and is identical across tools:

# Official Sentry remote MCP (recommended)
claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

# Local stdio alternative, if you need it
claude mcp add sentry -- npx -y @sentry/mcp-server

Then ask: “For the load-test window 14:00–14:08 UTC, pull the slowest transactions and any new errors from Sentry and line them up against the k6 threshold failures.” That turns a red CI run into a specific list of transactions to fix.

When This Breaks

Load-test numbers lie in predictable ways. These are the failure modes that send teams chasing the wrong fix.

When a result looks impossible, hand the full k6 summary and the generator’s resource metrics to your AI tool and ask it to distinguish a real server bottleneck from a test artifact before you escalate.

What’s Next

Microservices Architecture — tracing a bottleneck across service boundaries
CI/CD Pipelines — wiring the regression gate into your full deployment flow
Testing Quality Framework — where load testing fits in a broader testing strategy
Large Codebase Management — performance considerations at enterprise scale
Infrastructure as Code — provisioning representative load-testing environments