Skip to content

Production Performance Optimization

Your checkout service was answering in 200ms last week. Tonight, during peak hours, p99 is sitting at 3 seconds, conversion is dropping, and the on-call channel is full of red alerts. You have three dashboards open and no idea which signal is the cause and which is the symptom. This is the moment an AI assistant wired to your cluster and observability backend earns its keep: it can cross-reference pod metrics, traces, and slow queries in one pass instead of you tabbing between tools.

This guide shows DevOps engineers and SREs how to drive that investigation with Cursor, Claude Code, and Codex — each connected to the same Kubernetes and observability MCP servers — and how to turn the findings into a change you can safely ship.

  • A working MCP setup connecting Cursor, Claude Code, and Codex to Kubernetes, your observability backend, and Postgres
  • A copy-paste prompt for diagnosing a p99 latency regression across services
  • A copy-paste prompt for fixing connection-pool exhaustion under peak load
  • A copy-paste prompt for tuning Horizontal Pod Autoscaler (HPA) behaviour so it stops flapping
  • A “when this breaks” checklist for the failure modes that make AI-assisted tuning go wrong (stale metrics, hallucinated numbers, optimizations that regress under real traffic)

Essential MCP Servers for Performance Work

Section titled “Essential MCP Servers for Performance Work”

These servers are the same across all three tools — MCP config is identical whether you run Cursor, Claude Code, or Codex. Set them up once. The handles below (@kubernetes, @dynatrace, @last9, @postgres) are referenced throughout the prompts in this guide.

Kubernetes — cluster metrics and pod state

Section titled “Kubernetes — cluster metrics and pod state”

Install once, then point the server at your kubeconfig:

Terminal window
npm install -g kubectl-mcp-server
{
"mcpServers": {
"kubernetes": {
"command": "npx",
"args": ["-y", "kubectl-mcp-server"],
"env": {
"KUBECONFIG": "/path/to/kubeconfig"
}
}
}
}

Postgres — slow-query analysis and EXPLAIN plans

Section titled “Postgres — slow-query analysis and EXPLAIN plans”

Use the maintained postgres-mcp server (crystaldba), which runs via uvx and reads the connection string from DATABASE_URI. Start in restricted (read-only) mode for production debugging:

{
"mcpServers": {
"postgres": {
"command": "uvx",
"args": ["postgres-mcp", "--access-mode=restricted"],
"env": {
"DATABASE_URI": "postgresql://user:password@localhost:5432/dbname"
}
}
}
}

Pick the one you already run. Both are real, maintained packages. Authentication is the part people most often get wrong, so the exact setup is below — and both default to a credentials-light path.

{
"mcpServers": {
"dynatrace": {
"command": "npx",
"args": ["-y", "@dynatrace-oss/dynatrace-mcp-server@latest"],
"env": {
"DT_ENVIRONMENT": "https://<env-id>.apps.dynatrace.com"
}
}
}
}

With only DT_ENVIRONMENT set, auth is an in-browser OAuth Authorization Code flow on first use — that is the documented happy path, so this one variable is usually all you need. A non-interactive DT_PLATFORM_TOKEN is supported as an optional alternative for headless or CI setups.

This is the bread-and-butter workflow: something got slow, and you need to find the cause before the next traffic spike. The prompt is the same idea in each tool, but the invocation differs — Cursor uses @-handles in the agent panel, Claude Code and Codex take the request on the command line.

In the agent panel, reference the MCP servers by handle so Cursor pulls live data instead of guessing:

@kubernetes @last9 Our checkout-service p99 jumped from 200ms to 3s
during peak hours (18:00–21:00 UTC). Correlate the regression:
1. Pull checkout-service pod CPU/memory for the last 24h and flag
the time the p99 climbed.
2. From traces, identify which downstream span grew the most.
3. Check the Redis cache hit rate over the same window.
4. Tell me which signal is the cause and which are symptoms.
Do not propose fixes yet — I want the diagnosis and the evidence first.

The point of the “diagnosis first, no fixes” framing is to stop the model from jumping to a plausible-sounding remedy before it has read the data. You want it to cite the pod, the span, and the metric window — numbers you can confirm in the dashboard yourself.

A classic peak-hour failure: services throw too many clients errors, query latency spikes from 50ms to seconds, and the database is the bottleneck — not because it’s slow, but because connections are starved. AI is good here because the math (pool size × replica count vs max_connections) is mechanical, and it can read your current settings from the cluster.

  1. Pull the current pool config and the Postgres limit.

    @kubernetes @postgres Eight services share one Postgres instance.
    Each runs pool min=5/max=25. During peak we get connection timeouts
    and "too many clients" errors. Read the current replica counts from
    the cluster and the database max_connections, then tell me how many
    connections we actually demand at peak vs what the DB allows.
  2. Ask for a concrete config, not a lecture. Once the model has the numbers, have it propose specific pool sizes and a PgBouncer setup — with the reasoning tied to your replica count.

  3. Apply and watch. Roll the change to one service first, watch the connection-count and error-rate metrics through the next peak, then widen.

When an HPA scales up aggressively but scales down slowly — or thrashes every few minutes — you burn money and still miss spikes. The fix is usually in the stabilization windows and the metric the HPA scales on, and AI can read your current HorizontalPodAutoscaler objects and reason about the behaviour you describe.

@kubernetes Our HPAs flap: payment-service scales up at CPU 70%
(min=3/max=10) but CPU spikes to 95% before scaling triggers, and
scale-down happens every 2-3 minutes. Read the current HPA specs,
explain why it's reactive, and propose new behavior.scaleUp /
behavior.scaleDown stabilizationWindowSeconds plus a target metric
that anticipates load better than raw CPU.

A good answer here gives you a real behavior block — for example a short scaleUp window (around 60s) and a longer scaleDown window (300s+) so the cluster stops yo-yoing — plus a suggestion to scale on request rate or a custom metric rather than CPU alone, because CPU lags the actual load.

AI-assisted performance work fails in specific, recognizable ways. Know these before you trust a recommendation in production.

  • The MCP server can’t reach the cluster or backend. If the assistant returns vague, round-number metrics (“around 95% CPU”) instead of timestamped values, it’s almost certainly answering from training data, not your data. Confirm the connection: claude mcp list shows configured servers in Claude Code, and Cursor’s MCP settings panel shows a green/red status per server. Re-run the prompt only once data is actually flowing.
  • Hallucinated metrics. Even with a live connection, a model can invent a number to fill a gap in a query result. Treat every figure as a claim to verify in the source dashboard before you act on it. The “cite exact values and timestamps” instruction in the prompts above exists precisely so you can check.
  • The optimization regressed under real load. A change that looks great in staging can fall over at production concurrency — a smaller connection pool that’s fine at 1k req/s starves at 5k. Roll changes to one service or one replica first, watch through a real peak, and keep the previous config one kubectl rollout undo away.
  • The model optimized the wrong thing. If you skip the “diagnosis first” step, the assistant will happily tune a symptom. When a fix doesn’t move the top-line metric, go back to the diagnosis prompt and make it re-rank cause vs symptom with fresh data.