Monitoring and Observability

Your checkout service is throwing intermittent 504s. The traces stop at the gateway span, the dashboards show “everything green,” and your PM wants an ETA. You suspect the payment provider, but you can’t prove it because half your services aren’t instrumented and the ones that are export to a Jaeger instance nobody looks at. This is the gap AI coding agents close fastest: turning a half-wired observability stack into traces, metrics, dashboards, and alerts that actually answer “what broke and why.”

What you’ll walk away with

A working OpenTelemetry setup for Node.js using the current @opentelemetry/resources 2.x API (no deprecated Resource class).
Three tool-specific workflows (Cursor, Claude Code, Codex) for instrumenting a service and standing up a Grafana stack.
Copy-paste prompts for RED-metric dashboards, signal correlation, and bottleneck hunting that work with minimal editing.
The correct, verified MCP configs for Sentry and Grafana — the two that actually exist and ship from the vendors.
A failure-mode playbook for the three observability problems that waste the most time: high cardinality, missing traces, and alert storms.

The workflow

Step 1: Instrument the service with OpenTelemetry

Start by asking your agent to wire up OTel. The instrumentation workflow is largely identical across all three tools — the difference is how you invoke them and how each handles the multi-file edit. Point the agent at the current 2.x API explicitly; left to its own devices it will reach for the removed Resource class and the deprecated SemanticResourceAttributes.

Open Agent mode (Cmd/Ctrl+I) and reference the codebase so it picks up your existing entry point and dependencies:

@codebase Set up OpenTelemetry for our Node.js Express service.

Requirements:
- Use @opentelemetry/sdk-node with getNodeAutoInstrumentations
- Build the Resource with resourceFromAttributes() from @opentelemetry/resources
  (the Resource class is removed in resources 2.x) and ATTR_SERVICE_NAME /
  ATTR_SERVICE_VERSION from @opentelemetry/semantic-conventions
- OTLP/gRPC exporters for traces and metrics
- Disable the fs instrumentation (too noisy)
- Load it via node --require before app code, not inline

Review the diff in the multi-file editor before accepting — Cursor will usually touch package.json, a new instrumentation.ts, and your start script.

Run it from the repo root so Claude can install packages and create the bootstrap file in one pass:

claude "Set up OpenTelemetry for this Express service. Use @opentelemetry/sdk-node,
resourceFromAttributes() from @opentelemetry/resources (the Resource class is gone in 2.x),
ATTR_SERVICE_NAME from @opentelemetry/semantic-conventions, and OTLP/gRPC exporters.
Disable fs instrumentation. Wire it via --require in the start script."

Claude installs the packages, writes instrumentation.ts, and patches your package.json start script. Use claude --output-format json if you want to capture the file list for a CI check.

Use the Codex CLI (running on GPT-5.6 Sol) from the repo. Sandbox and approvals keep package installs reviewable:

codex --sandbox workspace-write -c approval_policy=on-request \
  "Set up OpenTelemetry for this Express service: @opentelemetry/sdk-node with
  getNodeAutoInstrumentations, resourceFromAttributes() from @opentelemetry/resources
  (Resource class removed in 2.x), ATTR_SERVICE_NAME from @opentelemetry/semantic-conventions,
  OTLP/gRPC exporters, fs instrumentation disabled, loaded via --require."

For a trusted unattended run in a disposable checkout, use codex exec --sandbox workspace-write -c approval_policy=never. This suppresses prompts without expanding the sandbox, so boundary-crossing actions fail; use a least-privilege CI identity. In the Codex IDE extension or Codex Cloud, paste the same instruction as a task — Cloud is handy when you want the change to land as a PR you review later.

The agent should produce something close to this. Note the 2.x API — resourceFromAttributes instead of new Resource(...), and ATTR_-prefixed constants instead of SemanticResourceAttributes:

// instrumentation.ts — loaded via: node --require ./instrumentation.ts app.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { resourceFromAttributes } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from '@opentelemetry/semantic-conventions';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  resource: resourceFromAttributes({
    [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'checkout',
    [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
    [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: process.env.NODE_ENV ?? 'development',
  }),
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
    exportIntervalMillis: 30_000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-fs': { enabled: false },
    }),
  ],
});

sdk.start();

For business-level spans, import the API explicitly — trace and metrics come from @opentelemetry/api, which auto-instrumentation does not import for you:

import { trace } from '@opentelemetry/api';

export function startBusinessSpan(operation: string, attrs: Record<string, string>) {
  const tracer = trace.getTracer('business-operations');
  return tracer.startSpan(operation, { attributes: { 'business.operation': operation, ...attrs } });
}

Step 2: Stand up the stack and dashboards

Once spans and metrics are flowing, get the storage and visualization layer up. A typical stack is OTel Collector to fan out, Prometheus for metrics, Loki for logs, Tempo for traces, and Grafana on top. Ask your agent for the Compose file, then ask it for the dashboard.

Copy-paste prompt for a RED-metrics service health dashboard:

Generate a Grafana dashboard JSON model for an Express service using Prometheus.
Panels:
- Request rate by route (sum by (route) of rate(http_server_duration_count[5m]))
- Error rate as a percentage (5xx / total) per route
- P50/P95/P99 latency from http_server_duration histogram buckets
- An error-budget burn-rate panel against a 99.9% SLO
Use the histogram_quantile pattern for latencies, set a 30s refresh, and template
the service name as a $service variable. Output valid dashboard JSON I can import.

Copy-paste prompt for finding the slow path in a request:

@tempo @grafana Write TraceQL/PromQL queries that surface our worst offenders:
- Endpoints ranked by P95 latency over the last hour
- The single slowest span type inside the checkout trace
- Services with the highest 5xx rate
- Traces with the most spans (a proxy for fan-out complexity)
Explain which query to run first when a 504 alert fires and why.

The Loki config is where agents most often emit dead syntax. Insist on the current TSDB schema — boltdb-shipper, schema: v11, and the shared_store key are legacy and will fail to start on Loki 3.x:

# Loki 3.x — TSDB schema (v13). shared_store is removed; do not add it.
schema_config:
  configs:
    - from: 2024-04-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  tsdb_shipper:
    active_index_directory: /loki/tsdb-index
    cache_location: /loki/tsdb-cache
  aws:
    s3: s3://endpoint/bucket
    s3forcepathstyle: true

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  # storage backend comes from storage_config; no shared_store here

Step 3: Generate alert rules that fire on the right signal

Alert rules are easy to get subtly wrong — the metric name drifts, the threshold is arbitrary, or the rule references a metric your exporter no longer emits. Ask for SLO-aware rules and pin the metrics to what kube-state-metrics actually produces today.

# Prometheus alert rules. Memory headroom uses kube_pod_container_resource_limits,
# not the removed cAdvisor container_spec_memory_limit_bytes metric.
groups:
  - name: service_health
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            / sum(rate(http_requests_total[5m])) by (service) > 0.05
        for: 5m
        labels: { severity: critical, team: platform }
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          runbook_url: "https://wiki.example.com/runbooks/high-error-rate"

      - alert: PodMemoryUsageHigh
        expr: |
          sum(container_memory_working_set_bytes{pod!=""}) by (pod, container)
            / sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, container) > 0.9
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "Pod {{ $labels.pod }} memory usage > 90%"

Copy-paste prompt for correlating signals during an incident:

Build a Grafana dashboard (or Explore workflow) that ties our three signals together
for incident triage:
- A Tempo panel listing traces over 1s in the selected window
- A Loki panel filtered by the trace_id from the selected trace
- Prometheus panels for CPU, memory, and DB connection-pool saturation over the same window
- A panel listing alerts that fired in that window
Use Grafana's trace-to-logs and exemplars so I can jump trace -> logs in one click.

MCP servers for monitoring

Two MCP servers turn your agent into an on-call assistant that can read live error and dashboard data. Both are vendor-official — be careful, because the npm registry has stub and third-party look-alikes for both.

Sentry

The canonical install is the remote server over HTTP with OAuth — no token to manage, and it works identically in Cursor, Claude Code, and Codex:

claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

If you need a local stdio server (air-gapped, or self-hosted Sentry), use the official package @sentry/mcp-server — not sentry-mcp, which is an unrelated 0.0.1 stub. The env var is SENTRY_ACCESS_TOKEN, and self-hosted instances pass --host with the hostname only:

{
  "mcpServers": {
    "sentry": {
      "command": "npx",
      "args": [
        "-y",
        "@sentry/mcp-server",
        "--access-token=${SENTRY_ACCESS_TOKEN}",
        "--host=your-org.sentry.io"
      ]
    }
  }
}

Copy-paste prompt once the Sentry MCP is connected:

Using the Sentry MCP: pull the top 5 unresolved issues for the checkout project in the
last 24h, ranked by users affected. For the worst one, summarize the stack trace, tell me
which release introduced it, and draft a fix as a diff against the offending file.

Grafana

The official server is grafana/mcp-grafana, a Go binary distributed via uvx (or Docker/Helm) — not the third-party @leval/mcp-grafana npm wrapper. Authenticate with a service-account token (GRAFANA_SERVICE_ACCOUNT_TOKEN); Grafana API keys are deprecated:

{
  "mcpServers": {
    "grafana": {
      "command": "uvx",
      "args": ["mcp-grafana"],
      "env": {
        "GRAFANA_URL": "https://grafana.example.com",
        "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${GRAFANA_SERVICE_ACCOUNT_TOKEN}"
      }
    }
  }
}

When this breaks

Three failure modes account for most wasted hours. Hand the symptom to your agent with the specific query it should run.

Prometheus slows to a crawl and memory balloons — usually a label like user_id or a raw URL exploding the time-series count.

Analyze our Prometheus instance for high-cardinality metrics:
- Run topk(10, count by (__name__)({__name__=~".+"})) to find the worst metrics
- Identify which labels are unbounded (user IDs, request IDs, raw paths)
- Propose recording rules to pre-aggregate, and relabel_config to drop the offending labels

Spans appear for some services but vanish at a boundary — almost always broken context propagation or an over-aggressive sampler.

Debug why traces stop at our API gateway:
- Confirm W3C traceparent headers are forwarded on outbound calls
- Check the sampler config (is it tail-based dropping the children?)
- Verify the OTLP exporter endpoint and TLS for the downstream service
- Show me how to force-sample one request end-to-end to prove the path

One incident fires fifty alerts and PagerDuty melts. The fix is grouping and inhibition, not more thresholds.

Our AlertManager floods during incidents. Rewrite the routing config to:
- Group by alertname, cluster, and service with sane group_wait/group_interval
- Add an inhibit_rule so a critical alert silences related warnings
- Route critical to PagerDuty and warnings to Slack only during business hours

What’s next

CI/CD Pipelines — wire dashboard and alert-rule validation into your deploy pipeline.
Incident Response — use the signals you just instrumented to run a faster postmortem.
Debugging Patterns — the trace-to-logs workflow when an alert finally fires.