Skip to content

Monitoring & Logging Patterns

Your API is throwing intermittent 500s. The Grafana dashboards are all green, the error logs don’t line up with the spans, and your PM wants an ETA. The painful truth: observability config is fiddly, easy to copy wrong, and the one trace you need got sampled away an hour ago.

This recipe shows how to use Cursor, Claude Code, and Codex to generate the instrumentation and config that catches these failures — and how to connect the AI directly to Grafana and Sentry so it can read your live telemetry while it debugs.

  • A copy-paste prompt that auto-instruments an Express or FastAPI service with OpenTelemetry and exports OTLP to your collector
  • Prompts that generate Prometheus recording rules, SLO burn-rate alerts, and an AlertManager routing tree you can review line by line
  • A Logstash pipeline prompt that parses JSON logs, redacts PII, and correlates trace_id so logs link back to traces
  • The Grafana and Sentry MCP setup that lets the agent query your real dashboards and issues instead of guessing
  • The failure modes that bite in production — cardinality explosions, dropped spans, sampling that hides the trace you need

Wire Up the Observability MCP Servers First

Section titled “Wire Up the Observability MCP Servers First”

Before generating config, give the agent eyes on your live telemetry. With the Grafana and Sentry MCP servers connected, the model can read dashboards, query datasources, and pull stack traces — so its suggestions are grounded in your actual data, not generic boilerplate.

Add to .cursor/mcp.json. Sentry’s official server is remote (hosted); Grafana ships an official Go binary you run locally and point at your instance:

{
"mcpServers": {
"sentry": { "url": "https://mcp.sentry.dev/mcp" },
"grafana": {
"command": "mcp-grafana",
"env": {
"GRAFANA_URL": "https://grafana.example.com",
"GRAFANA_API_KEY": "${GRAFANA_SERVICE_ACCOUNT_TOKEN}"
}
}
}
}

The payoff: instead of pasting a stack trace into chat, you ask the agent to fetch it.

Instrument the Service (Where Most Bugs Actually Start)

Section titled “Instrument the Service (Where Most Bugs Actually Start)”

Dashboards are downstream of instrumentation. If a service emits the wrong attributes or no trace context, no amount of Grafana panels will save you. This is the highest-leverage thing to get right, and it’s identical regardless of which agent you use — the prompt is what matters.

What the agent generates is the standard SDK bootstrap — the part worth reviewing is the exporter and shutdown:

// tracing.js — load with: node --require ./tracing.js server.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-proto');
const { resourceFromAttributes } = require('@opentelemetry/resources');
const {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
} = require('@opentelemetry/semantic-conventions');
const sdk = new NodeSDK({
resource: resourceFromAttributes({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'api-service',
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
'deployment.environment': process.env.ENVIRONMENT ?? 'production',
}),
// Jaeger ingests OTLP natively since v1.35 — no Jaeger exporter needed.
traceExporter: new OTLPTraceExporter({
url: process.env.OTLP_ENDPOINT ?? 'http://otel-collector:4318/v1/traces',
}),
instrumentations: [getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
})],
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown().finally(() => process.exit(0)));

Prometheus rule files are where AI assistance shines — the syntax is finicky and the PromQL is easy to get subtly wrong. Keep the prompt opinionated so you get RED-method recording rules plus burn-rate alerts, not a wall of every metric imaginable.

The generated recording rules look like this — short, reviewable, and the foundation every alert builds on:

rules/api.yml
groups:
- name: api_recording
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
- record: job:http_errors:ratio5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (job) (rate(http_requests_total[5m]))
- record: job:http_latency:p95_5m
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

For AlertManager routing, hand the agent your escalation policy in plain English and let it produce the tree:

The single highest-value logging move is making logs link to traces. If every log line carries trace_id, you click from a Grafana span straight to the surrounding logs. The prompt below produces a Logstash pipeline that does exactly that.

The trace-correlation block is the part to verify — make sure the field names match what your OpenTelemetry SDK emits:

filter {
if [message] =~ /^\{/ {
json { source => "message" target => "parsed" }
mutate {
rename => {
"[parsed][trace_id]" => "trace_id"
"[parsed][span_id]" => "span_id"
"[parsed][level]" => "log_level"
}
}
}
if [environment] == "production" and [log_level] == "DEBUG" { drop {} }
}

The default head sampling drops traces blindly — which is how the one trace you needed disappears. Tail sampling decides after a trace finishes, so you can keep 100% of errors and slow requests while sampling the boring stuff at 5%.

  1. Spans never show up. Check the exporter endpoint and protocol mismatch first — OTLP/HTTP is 4318, OTLP/gRPC is 4317, and the proto vs http/json exporter must match the receiver. Run the collector’s debug exporter (verbosity: detailed) to confirm spans arrive before blaming the backend.

  2. Prometheus is OOMing or scrapes are slow. You have a cardinality explosion. Query topk(20, count by (__name__)({__name__=~".+"})) and prometheus_tsdb_head_series to find the offender, then drop the culprit label with a metric_relabel_configs rule. This is almost always a label the AI added that you didn’t catch in review.

  3. Alerts fire late or in storms. Re-check for:, group_wait, and repeat_interval. A burn-rate alert with too long a window won’t page until you’ve already blown the budget; group_wait: 0s on everything turns a cluster outage into hundreds of pages.

  4. Logs don’t correlate to traces. The trace_id field name from your SDK doesn’t match what the pipeline expects, or it’s nested. Log one raw event and confirm the exact path before trusting the rename/mutate block.

  5. The trace you needed got sampled away. Head sampling dropped it. Move error/latency decisions to tail sampling in the collector (above), and keep errors at 100%.