Monitoring & Logging Patterns

Your API is throwing intermittent 500s. The Grafana dashboards are all green, the error logs don’t line up with the spans, and your PM wants an ETA. The painful truth: observability config is fiddly, easy to copy wrong, and the one trace you need got sampled away an hour ago.

This recipe shows how to use Cursor, Claude Code, and Codex to generate the instrumentation and config that catches these failures — and how to connect the AI directly to Grafana and Sentry so it can read your live telemetry while it debugs.

What You’ll Walk Away With

A copy-paste prompt that auto-instruments an Express or FastAPI service with OpenTelemetry and exports OTLP to your collector
Prompts that generate Prometheus recording rules, SLO burn-rate alerts, and an AlertManager routing tree you can review line by line
A Logstash pipeline prompt that parses JSON logs, redacts PII, and correlates trace_id so logs link back to traces
The Grafana and Sentry MCP setup that lets the agent query your real dashboards and issues instead of guessing
The failure modes that bite in production — cardinality explosions, dropped spans, sampling that hides the trace you need

Wire Up the Observability MCP Servers First

Before generating config, give the agent eyes on your live telemetry. With the Grafana and Sentry MCP servers connected, the model can read dashboards, query datasources, and pull stack traces — so its suggestions are grounded in your actual data, not generic boilerplate.

Add to .cursor/mcp.json. Sentry’s official server is remote (hosted); Grafana ships an official Go binary you run locally and point at your instance:

{
  "mcpServers": {
    "sentry": { "url": "https://mcp.sentry.dev/mcp" },
    "grafana": {
      "command": "mcp-grafana",
      "env": {
        "GRAFANA_URL": "https://grafana.example.com",
        "GRAFANA_API_KEY": "${GRAFANA_SERVICE_ACCOUNT_TOKEN}"
      }
    }
  }
}

Sentry is the official remote server (authenticate with /mcp after adding). Grafana’s official grafana/mcp-grafana binary runs over stdio:

# Official remote Sentry MCP — auth via /mcp
claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

# Official Grafana Labs MCP (the mcp-grafana binary / mcp/grafana Docker image)
claude mcp add grafana \
  --env GRAFANA_URL=https://grafana.example.com \
  --env GRAFANA_API_KEY=$GRAFANA_SERVICE_ACCOUNT_TOKEN \
  -- mcp-grafana

Codex stores MCP entries in ~/.codex/config.toml. Use --url for the hosted Sentry server and a stdio command for Grafana:

# Hosted Sentry MCP (OAuth: codex mcp login sentry)
codex mcp add sentry --url https://mcp.sentry.dev/mcp

# Grafana Labs stdio server
codex mcp add grafana \
  --env GRAFANA_URL=https://grafana.example.com \
  --env GRAFANA_API_KEY=$GRAFANA_SERVICE_ACCOUNT_TOKEN \
  -- mcp-grafana

The payoff: instead of pasting a stack trace into chat, you ask the agent to fetch it.

Instrument the Service (Where Most Bugs Actually Start)

Dashboards are downstream of instrumentation. If a service emits the wrong attributes or no trace context, no amount of Grafana panels will save you. This is the highest-leverage thing to get right, and it’s identical regardless of which agent you use — the prompt is what matters.

What the agent generates is the standard SDK bootstrap — the part worth reviewing is the exporter and shutdown:

// tracing.js — load with: node --require ./tracing.js server.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');
const { resourceFromAttributes } = require('@opentelemetry/resources');
const {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} = require('@opentelemetry/semantic-conventions');

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api-service',
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
    'deployment.environment': process.env.ENVIRONMENT ?? 'production',
  }),
  // Jaeger ingests OTLP natively since v1.35 — no Jaeger exporter needed.
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTLP_ENDPOINT ?? 'http://otel-collector:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-fs': { enabled: false },
  })],
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown().finally(() => process.exit(0)));

Generate Prometheus Rules and SLO Alerts

Prometheus rule files are where AI assistance shines — the syntax is finicky and the PromQL is easy to get subtly wrong. Keep the prompt opinionated so you get RED-method recording rules plus burn-rate alerts, not a wall of every metric imaginable.

The generated recording rules look like this — short, reviewable, and the foundation every alert builds on:

groups:
  - name: api_recording
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))
      - record: job:http_latency:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

For AlertManager routing, hand the agent your escalation policy in plain English and let it produce the tree:

Parse and Correlate Logs

The single highest-value logging move is making logs link to traces. If every log line carries trace_id, you click from a Grafana span straight to the surrounding logs. The prompt below produces a Logstash pipeline that does exactly that.

The trace-correlation block is the part to verify — make sure the field names match what your OpenTelemetry SDK emits:

filter {
  if [message] =~ /^\{/ {
    json { source => "message" target => "parsed" }
    mutate {
      rename => {
        "[parsed][trace_id]" => "trace_id"
        "[parsed][span_id]"  => "span_id"
        "[parsed][level]"    => "log_level"
      }
    }
  }
  if [environment] == "production" and [log_level] == "DEBUG" { drop {} }
}

Tail-Sampling in the Collector

The default head sampling drops traces blindly — which is how the one trace you needed disappears. Tail sampling decides after a trace finishes, so you can keep 100% of errors and slow requests while sampling the boring stuff at 5%.

When This Breaks

Spans never show up. Check the exporter endpoint and protocol mismatch first — OTLP/HTTP is 4318, OTLP/gRPC is 4317, and the proto vs http/json exporter must match the receiver. Run the collector’s debug exporter (verbosity: detailed) to confirm spans arrive before blaming the backend.
Prometheus is OOMing or scrapes are slow. You have a cardinality explosion. Query topk(20, count by (__name__)({__name__=~".+"})) and prometheus_tsdb_head_series to find the offender, then drop the culprit label with a metric_relabel_configs rule. This is almost always a label the AI added that you didn’t catch in review.
Alerts fire late or in storms. Re-check for:, group_wait, and repeat_interval. A burn-rate alert with too long a window won’t page until you’ve already blown the budget; group_wait: 0s on everything turns a cluster outage into hundreds of pages.
Logs don’t correlate to traces. The trace_id field name from your SDK doesn’t match what the pipeline expects, or it’s nested. Log one raw event and confirm the exact path before trusting the rename/mutate block.
The trace you needed got sampled away. Head sampling dropped it. Move error/latency decisions to tail sampling in the collector (above), and keep errors at 100%.

What’s Next

SQL Patterns Recipes Connect a database MCP server and let the agent profile slow queries against real data.

CI/CD Pipeline Recipes Wire these alerts and SLOs into deploy gates and automated rollbacks.

Debugging Patterns The systematic prompt patterns for turning telemetry into a root cause.

MCP Best Practices How MCP works across Cursor, Claude Code, and Codex, and how to vet a server.