Monitoring & Logging Patterns
Your API is throwing intermittent 500s. The Grafana dashboards are all green, the error logs don’t line up with the spans, and your PM wants an ETA. The painful truth: observability config is fiddly, easy to copy wrong, and the one trace you need got sampled away an hour ago.
This recipe shows how to use Cursor, Claude Code, and Codex to generate the instrumentation and config that catches these failures — and how to connect the AI directly to Grafana and Sentry so it can read your live telemetry while it debugs.
What You’ll Walk Away With
Section titled “What You’ll Walk Away With”- A copy-paste prompt that auto-instruments an Express or FastAPI service with OpenTelemetry and exports OTLP to your collector
- Prompts that generate Prometheus recording rules, SLO burn-rate alerts, and an AlertManager routing tree you can review line by line
- A Logstash pipeline prompt that parses JSON logs, redacts PII, and correlates
trace_idso logs link back to traces - The Grafana and Sentry MCP setup that lets the agent query your real dashboards and issues instead of guessing
- The failure modes that bite in production — cardinality explosions, dropped spans, sampling that hides the trace you need
Wire Up the Observability MCP Servers First
Section titled “Wire Up the Observability MCP Servers First”Before generating config, give the agent eyes on your live telemetry. With the Grafana and Sentry MCP servers connected, the model can read dashboards, query datasources, and pull stack traces — so its suggestions are grounded in your actual data, not generic boilerplate.
Add to .cursor/mcp.json. Sentry’s official server is remote (hosted); Grafana ships an official Go binary you run locally and point at your instance:
{ "mcpServers": { "sentry": { "url": "https://mcp.sentry.dev/mcp" }, "grafana": { "command": "mcp-grafana", "env": { "GRAFANA_URL": "https://grafana.example.com", "GRAFANA_API_KEY": "${GRAFANA_SERVICE_ACCOUNT_TOKEN}" } } }}Sentry is the official remote server (authenticate with /mcp after adding). Grafana’s official grafana/mcp-grafana binary runs over stdio:
# Official remote Sentry MCP — auth via /mcpclaude mcp add --transport http sentry https://mcp.sentry.dev/mcp
# Official Grafana Labs MCP (the mcp-grafana binary / mcp/grafana Docker image)claude mcp add grafana \ --env GRAFANA_URL=https://grafana.example.com \ --env GRAFANA_API_KEY=$GRAFANA_SERVICE_ACCOUNT_TOKEN \ -- mcp-grafanaCodex stores MCP entries in ~/.codex/config.toml. Use --url for the hosted Sentry server and a stdio command for Grafana:
# Hosted Sentry MCP (OAuth: codex mcp login sentry)codex mcp add sentry --url https://mcp.sentry.dev/mcp
# Grafana Labs stdio servercodex mcp add grafana \ --env GRAFANA_URL=https://grafana.example.com \ --env GRAFANA_API_KEY=$GRAFANA_SERVICE_ACCOUNT_TOKEN \ -- mcp-grafanaThe payoff: instead of pasting a stack trace into chat, you ask the agent to fetch it.
Instrument the Service (Where Most Bugs Actually Start)
Section titled “Instrument the Service (Where Most Bugs Actually Start)”Dashboards are downstream of instrumentation. If a service emits the wrong attributes or no trace context, no amount of Grafana panels will save you. This is the highest-leverage thing to get right, and it’s identical regardless of which agent you use — the prompt is what matters.
What the agent generates is the standard SDK bootstrap — the part worth reviewing is the exporter and shutdown:
// tracing.js — load with: node --require ./tracing.js server.jsconst { NodeSDK } = require('@opentelemetry/sdk-node');const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');const { resourceFromAttributes } = require('@opentelemetry/resources');const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION,} = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({ resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api-service', [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0', 'deployment.environment': process.env.ENVIRONMENT ?? 'production', }), // Jaeger ingests OTLP natively since v1.35 — no Jaeger exporter needed. traceExporter: new OTLPTraceExporter({ url: process.env.OTLP_ENDPOINT ?? 'http://otel-collector:4318/v1/traces', }), instrumentations: [getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, })],});
sdk.start();process.on('SIGTERM', () => sdk.shutdown().finally(() => process.exit(0)));Generate Prometheus Rules and SLO Alerts
Section titled “Generate Prometheus Rules and SLO Alerts”Prometheus rule files are where AI assistance shines — the syntax is finicky and the PromQL is easy to get subtly wrong. Keep the prompt opinionated so you get RED-method recording rules plus burn-rate alerts, not a wall of every metric imaginable.
The generated recording rules look like this — short, reviewable, and the foundation every alert builds on:
groups: - name: api_recording interval: 30s rules: - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) - record: job:http_errors:ratio5m expr: | sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (job) (rate(http_requests_total[5m])) - record: job:http_latency:p95_5m expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))For AlertManager routing, hand the agent your escalation policy in plain English and let it produce the tree:
Parse and Correlate Logs
Section titled “Parse and Correlate Logs”The single highest-value logging move is making logs link to traces. If every log line carries trace_id, you click from a Grafana span straight to the surrounding logs. The prompt below produces a Logstash pipeline that does exactly that.
The trace-correlation block is the part to verify — make sure the field names match what your OpenTelemetry SDK emits:
filter { if [message] =~ /^\{/ { json { source => "message" target => "parsed" } mutate { rename => { "[parsed][trace_id]" => "trace_id" "[parsed][span_id]" => "span_id" "[parsed][level]" => "log_level" } } } if [environment] == "production" and [log_level] == "DEBUG" { drop {} }}Tail-Sampling in the Collector
Section titled “Tail-Sampling in the Collector”The default head sampling drops traces blindly — which is how the one trace you needed disappears. Tail sampling decides after a trace finishes, so you can keep 100% of errors and slow requests while sampling the boring stuff at 5%.
When This Breaks
Section titled “When This Breaks”-
Spans never show up. Check the exporter endpoint and protocol mismatch first — OTLP/HTTP is 4318, OTLP/gRPC is 4317, and the
protovshttp/jsonexporter must match the receiver. Run the collector’sdebugexporter (verbosity: detailed) to confirm spans arrive before blaming the backend. -
Prometheus is OOMing or scrapes are slow. You have a cardinality explosion. Query
topk(20, count by (__name__)({__name__=~".+"}))andprometheus_tsdb_head_seriesto find the offender, then drop the culprit label with ametric_relabel_configsrule. This is almost always a label the AI added that you didn’t catch in review. -
Alerts fire late or in storms. Re-check
for:,group_wait, andrepeat_interval. A burn-rate alert with too long a window won’t page until you’ve already blown the budget;group_wait: 0son everything turns a cluster outage into hundreds of pages. -
Logs don’t correlate to traces. The
trace_idfield name from your SDK doesn’t match what the pipeline expects, or it’s nested. Log one raw event and confirm the exact path before trusting therename/mutateblock. -
The trace you needed got sampled away. Head sampling dropped it. Move error/latency decisions to tail sampling in the collector (above), and keep errors at 100%.