Production Performance Optimization
Your checkout service was answering in 200ms last week. Tonight, during peak hours, p99 is sitting at 3 seconds, conversion is dropping, and the on-call channel is full of red alerts. You have three dashboards open and no idea which signal is the cause and which is the symptom. This is the moment an AI assistant wired to your cluster and observability backend earns its keep: it can cross-reference pod metrics, traces, and slow queries in one pass instead of you tabbing between tools.
This guide shows DevOps engineers and SREs how to drive that investigation with Cursor, Claude Code, and Codex — each connected to the same Kubernetes and observability MCP servers — and how to turn the findings into a change you can safely ship.
What You’ll Walk Away With
Section titled “What You’ll Walk Away With”- A working MCP setup connecting Cursor, Claude Code, and Codex to Kubernetes, your observability backend, and Postgres
- A copy-paste prompt for diagnosing a p99 latency regression across services
- A copy-paste prompt for fixing connection-pool exhaustion under peak load
- A copy-paste prompt for tuning Horizontal Pod Autoscaler (HPA) behaviour so it stops flapping
- A “when this breaks” checklist for the failure modes that make AI-assisted tuning go wrong (stale metrics, hallucinated numbers, optimizations that regress under real traffic)
Essential MCP Servers for Performance Work
Section titled “Essential MCP Servers for Performance Work”These servers are the same across all three tools — MCP config is identical whether you run Cursor, Claude Code, or Codex. Set them up once. The handles below (@kubernetes, @dynatrace, @last9, @postgres) are referenced throughout the prompts in this guide.
Kubernetes — cluster metrics and pod state
Section titled “Kubernetes — cluster metrics and pod state”Install once, then point the server at your kubeconfig:
npm install -g kubectl-mcp-server{ "mcpServers": { "kubernetes": { "command": "npx", "args": ["-y", "kubectl-mcp-server"], "env": { "KUBECONFIG": "/path/to/kubeconfig" } } }}Postgres — slow-query analysis and EXPLAIN plans
Section titled “Postgres — slow-query analysis and EXPLAIN plans”Use the maintained postgres-mcp server (crystaldba), which runs via uvx and reads the connection string from DATABASE_URI. Start in restricted (read-only) mode for production debugging:
{ "mcpServers": { "postgres": { "command": "uvx", "args": ["postgres-mcp", "--access-mode=restricted"], "env": { "DATABASE_URI": "postgresql://user:password@localhost:5432/dbname" } } }}Observability backend — traces and APM
Section titled “Observability backend — traces and APM”Pick the one you already run. Both are real, maintained packages. Authentication is the part people most often get wrong, so the exact setup is below — and both default to a credentials-light path.
{ "mcpServers": { "dynatrace": { "command": "npx", "args": ["-y", "@dynatrace-oss/dynatrace-mcp-server@latest"], "env": { "DT_ENVIRONMENT": "https://<env-id>.apps.dynatrace.com" } } }}With only DT_ENVIRONMENT set, auth is an in-browser OAuth Authorization Code flow on first use — that is the documented happy path, so this one variable is usually all you need. A non-interactive DT_PLATFORM_TOKEN is supported as an optional alternative for headless or CI setups.
The maintainer’s recommended path is the hosted HTTP endpoint, which uses in-browser OAuth and needs no env vars at all:
claude mcp add --transport http last9 \ https://app.last9.io/api/v4/organizations/<org_slug>/mcpIf you prefer the local stdio binary, it needs exactly one variable — a refresh token:
{ "mcpServers": { "last9": { "command": "npx", "args": ["-y", "@last9/mcp-server@latest"], "env": { "LAST9_REFRESH_TOKEN": "<write_refresh_token>" } } }}Generate LAST9_REFRESH_TOKEN (with Write permission, admin-only) under Settings → API Access at app.last9.io/settings/api-access — not the OTLP integration page. The only other variable is the optional LAST9_API_HOST (defaults to app.last9.io), for self-managed hosts.
Diagnosing a p99 Latency Regression
Section titled “Diagnosing a p99 Latency Regression”This is the bread-and-butter workflow: something got slow, and you need to find the cause before the next traffic spike. The prompt is the same idea in each tool, but the invocation differs — Cursor uses @-handles in the agent panel, Claude Code and Codex take the request on the command line.
In the agent panel, reference the MCP servers by handle so Cursor pulls live data instead of guessing:
@kubernetes @last9 Our checkout-service p99 jumped from 200ms to 3sduring peak hours (18:00–21:00 UTC). Correlate the regression:
1. Pull checkout-service pod CPU/memory for the last 24h and flag the time the p99 climbed.2. From traces, identify which downstream span grew the most.3. Check the Redis cache hit rate over the same window.4. Tell me which signal is the cause and which are symptoms.
Do not propose fixes yet — I want the diagnosis and the evidence first.Run it headless from the terminal; Claude Code calls the same MCP servers:
claude "Using the kubernetes and last9 MCP servers, diagnose why \checkout-service p99 jumped from 200ms to 3s during 18:00-21:00 UTC. \Correlate pod CPU/memory, the slowest downstream trace span, and the \Redis cache hit rate. Identify cause vs symptom with evidence. \Do not propose fixes yet."Codex takes the prompt positionally and reaches the MCP servers the same way. Keep it read-only for a diagnosis run:
codex --ask-for-approval on-request "Using the kubernetes and last9 \MCP servers, diagnose the checkout-service p99 regression (200ms -> 3s, \18:00-21:00 UTC). Correlate pod CPU/memory, the slowest downstream span, \and the Redis cache hit rate. Report cause vs symptom with evidence only."The point of the “diagnosis first, no fixes” framing is to stop the model from jumping to a plausible-sounding remedy before it has read the data. You want it to cite the pod, the span, and the metric window — numbers you can confirm in the dashboard yourself.
Fixing Connection-Pool Exhaustion
Section titled “Fixing Connection-Pool Exhaustion”A classic peak-hour failure: services throw too many clients errors, query latency spikes from 50ms to seconds, and the database is the bottleneck — not because it’s slow, but because connections are starved. AI is good here because the math (pool size × replica count vs max_connections) is mechanical, and it can read your current settings from the cluster.
-
Pull the current pool config and the Postgres limit.
@kubernetes @postgres Eight services share one Postgres instance.Each runs pool min=5/max=25. During peak we get connection timeoutsand "too many clients" errors. Read the current replica counts fromthe cluster and the database max_connections, then tell me how manyconnections we actually demand at peak vs what the DB allows.Terminal window claude "Using the kubernetes and postgres MCP servers, read the \replica count for each service and Postgres max_connections. Compute \peak connection demand (pool max x replicas, summed) vs the limit. \Show the arithmetic."Terminal window codex --ask-for-approval on-request "Using the kubernetes and \postgres MCP servers, compute peak Postgres connection demand \(pool max x replicas per service, summed) against max_connections. \Show the arithmetic and where it overflows." -
Ask for a concrete config, not a lecture. Once the model has the numbers, have it propose specific pool sizes and a PgBouncer setup — with the reasoning tied to your replica count.
-
Apply and watch. Roll the change to one service first, watch the connection-count and error-rate metrics through the next peak, then widen.
Tuning Autoscaling That Flaps
Section titled “Tuning Autoscaling That Flaps”When an HPA scales up aggressively but scales down slowly — or thrashes every few minutes — you burn money and still miss spikes. The fix is usually in the stabilization windows and the metric the HPA scales on, and AI can read your current HorizontalPodAutoscaler objects and reason about the behaviour you describe.
@kubernetes Our HPAs flap: payment-service scales up at CPU 70%(min=3/max=10) but CPU spikes to 95% before scaling triggers, andscale-down happens every 2-3 minutes. Read the current HPA specs,explain why it's reactive, and propose new behavior.scaleUp /behavior.scaleDown stabilizationWindowSeconds plus a target metricthat anticipates load better than raw CPU.claude "Using the kubernetes MCP server, read the HorizontalPodAutoscaler \specs. payment-service scales at CPU 70% but spikes to 95% before \triggering and scales down every 2-3 min. Propose behavior.scaleUp / \scaleDown stabilizationWindowSeconds and a better target metric. \Output the patched HPA YAML."codex --ask-for-approval on-request "Using the kubernetes MCP server, \read the HPA specs and fix the flapping payment-service HPA (CPU 70% \target, spikes to 95%, scales down every 2-3 min). Propose scaleUp/ \scaleDown stabilizationWindowSeconds and a request-rate target metric. \Output patched YAML; do not apply it."A good answer here gives you a real behavior block — for example a short scaleUp window (around 60s) and a longer scaleDown window (300s+) so the cluster stops yo-yoing — plus a suggestion to scale on request rate or a custom metric rather than CPU alone, because CPU lags the actual load.
When This Breaks
Section titled “When This Breaks”AI-assisted performance work fails in specific, recognizable ways. Know these before you trust a recommendation in production.
- The MCP server can’t reach the cluster or backend. If the assistant returns vague, round-number metrics (“around 95% CPU”) instead of timestamped values, it’s almost certainly answering from training data, not your data. Confirm the connection:
claude mcp listshows configured servers in Claude Code, and Cursor’s MCP settings panel shows a green/red status per server. Re-run the prompt only once data is actually flowing. - Hallucinated metrics. Even with a live connection, a model can invent a number to fill a gap in a query result. Treat every figure as a claim to verify in the source dashboard before you act on it. The “cite exact values and timestamps” instruction in the prompts above exists precisely so you can check.
- The optimization regressed under real load. A change that looks great in staging can fall over at production concurrency — a smaller connection pool that’s fine at 1k req/s starves at 5k. Roll changes to one service or one replica first, watch through a real peak, and keep the previous config one
kubectl rollout undoaway. - The model optimized the wrong thing. If you skip the “diagnosis first” step, the assistant will happily tune a symptom. When a fix doesn’t move the top-line metric, go back to the diagnosis prompt and make it re-rank cause vs symptom with fresh data.
What’s Next
Section titled “What’s Next”- Database Development with AI — deeper Postgres MCP workflows for schema and query work
- MCP Best Practices — scoping credentials and managing server configs across tools
- Incident Response with AI — turning a diagnosis into a postmortem and a guardrail