Skip to content

Load Testing and Performance Analysis

Your checkout endpoint holds up fine in staging, then falls over on the first marketing push. p95 latency climbs from 120ms to 4s, Postgres connections max out, and nobody can point to the change that did it. You need a load test that reproduces the spike, a read on which layer saturates first, and a regression gate in CI so the next deploy doesn’t surprise you at 9pm.

This guide shows how to drive that loop with AI: generate a realistic k6 script, run it, have the model read the output and the slow-query log, and wire a threshold gate into the pipeline. The same prompts work in Cursor, Claude Code, and Codex — the only thing that differs is how you invoke each tool.

  • A copy-paste prompt that generates a runnable k6 script with a ramping VU profile and thresholds for a specific endpoint
  • The official k6 MCP server (grafana/mcp-k6) wired into all three tools, plus k6 x agent init to bootstrap it in one command
  • A prompt that turns a k6 summary plus a Postgres slow-query log into a ranked, fix-first bottleneck list
  • A Playwright + Lighthouse prompt that measures Core Web Vitals (LCP, INP, CLS) under concurrent load
  • A CI regression gate that fails the build when p95 regresses, using k6 run exit codes
  • The failure modes that make load-test numbers lie — and how to catch them before you trust the dashboard

The piece worth installing is the official k6 MCP server from Grafana (grafana/mcp-k6). It lets the agent validate scripts, run tests, and read k6 docs without you copy-pasting CLI output back and forth. It ships as a Go binary or Docker image — not an npm package, so ignore any npx-based “k6 MCP” you find; those are unofficial wrappers.

The fastest path is k6 x agent init (requires k6 v2.0+ on your PATH), which drops the right skill files into your editor and registers the MCP server for whichever tool you’re using:

Terminal window
# Bootstrap k6 skills + MCP config for one editor...
k6 x agent init claude-code # or cursor, codex
# ...or wire up every supported editor at once
k6 x agent init --all

If you’d rather wire the MCP server up by hand, the install is identical across tools — only the registration command differs.

Install the binary, then add it in Settings → MCP → Add with mcp-k6 as the command (stdio). Or drop this into .cursor/mcp.json:

{
"mcpServers": {
"k6": { "command": "mcp-k6" }
}
}

Install the binary first: brew tap grafana/grafana && brew install mcp-k6 (or docker pull grafana/mcp-k6:latest).

For browser-side performance you’ll also want the Playwright MCP (@playwright/mcp), which is the same setup in every tool:

Terminal window
claude mcp add playwright -- npx -y @playwright/mcp@latest

In Cursor add it via Settings → MCP; in Codex add a [mcp_servers.playwright] block pointing at the same npx command.

The loop is the same in every tool: generate a script, run it, feed the results back for analysis, then gate it in CI. What changes is the invocation.

  1. Generate the script from a real endpoint, a target VU count, a ramp profile, and explicit thresholds — not “write a load test.”
  2. Run it against staging (k6 run script.js), with the MCP server letting the agent execute and read output directly.
  3. Analyze the results by handing the k6 summary plus the slow-query log to the model and asking for a ranked bottleneck list.
  4. Gate it in CI so a p95 regression fails the build before it ships.

Here is how you kick off step 1 and 2 in each tool. The prompt is identical — paste the load-test generation prompt below into any of them.

Open the agent panel (Cmd+I), paste the prompt, and let it write tests/load/checkout.js. With the k6 MCP server connected, the agent can run k6 run itself and iterate on threshold failures inline. Use a checkpoint before the run so you can roll the script back if the generated VU profile is unrealistic.

A generated script should come out looking roughly like this — concrete VU stages and thresholds, not a happy-path skeleton:

import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 200 }, // ramp to peak
{ duration: '5m', target: 200 }, // hold
{ duration: '1m', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<300', 'p(99)<800'],
http_req_failed: ['rate<0.01'],
},
};
export default function () {
const res = http.post(`${__ENV.STAGING_URL}/api/checkout`, JSON.stringify({
cartId: 'c_load_test', paymentMethod: 'pm_test',
}), { headers: { 'Content-Type': 'application/json' } });
check(res, { 'status is 200': (r) => r.status === 200 });
sleep(1); // think time so 200 VUs != 200 RPS
}

k6 run exits non-zero when a threshold fails, which is exactly what makes the CI gate in the last step trivial.

Because k6 run returns a non-zero exit code when any threshold is breached, the CI step is just running the script — no custom comparison logic needed. Use the official grafana/setup-k6-action, not a hand-rolled curl install:

.github/workflows/load-test.yml
- uses: actions/checkout@v5
- uses: grafana/setup-k6-action@v1
- run: k6 run tests/load/checkout.js
env:
STAGING_URL: ${{ secrets.STAGING_URL }}
TOKEN: ${{ secrets.LOAD_TEST_TOKEN }}

To go further, connect the Sentry MCP so the agent can correlate threshold failures with the errors and slow transactions Sentry recorded during the run. The remote server is the simplest setup and is identical across tools:

Terminal window
# Official Sentry remote MCP (recommended)
claude mcp add --transport http sentry https://mcp.sentry.dev/mcp
# Local stdio alternative, if you need it
claude mcp add sentry -- npx -y @sentry/mcp-server

Then ask: “For the load-test window 14:00–14:08 UTC, pull the slowest transactions and any new errors from Sentry and line them up against the k6 threshold failures.” That turns a red CI run into a specific list of transactions to fix.

Load-test numbers lie in predictable ways. These are the failure modes that send teams chasing the wrong fix.

When a result looks impossible, hand the full k6 summary and the generator’s resource metrics to your AI tool and ask it to distinguish a real server bottleneck from a test artifact before you escalate.