Monitoring and Observability
Your checkout service is throwing intermittent 504s. The traces stop at the gateway span, the dashboards show “everything green,” and your PM wants an ETA. You suspect the payment provider, but you can’t prove it because half your services aren’t instrumented and the ones that are export to a Jaeger instance nobody looks at. This is the gap AI coding agents close fastest: turning a half-wired observability stack into traces, metrics, dashboards, and alerts that actually answer “what broke and why.”
What you’ll walk away with
Section titled “What you’ll walk away with”- A working OpenTelemetry setup for Node.js using the current
@opentelemetry/resources2.x API (no deprecatedResourceclass). - Three tool-specific workflows (Cursor, Claude Code, Codex) for instrumenting a service and standing up a Grafana stack.
- Copy-paste prompts for RED-metric dashboards, signal correlation, and bottleneck hunting that work with minimal editing.
- The correct, verified MCP configs for Sentry and Grafana — the two that actually exist and ship from the vendors.
- A failure-mode playbook for the three observability problems that waste the most time: high cardinality, missing traces, and alert storms.
The workflow
Section titled “The workflow”Step 1: Instrument the service with OpenTelemetry
Section titled “Step 1: Instrument the service with OpenTelemetry”Start by asking your agent to wire up OTel. The instrumentation workflow is largely identical across all three tools — the difference is how you invoke them and how each handles the multi-file edit. Point the agent at the current 2.x API explicitly; left to its own devices it will reach for the removed Resource class and the deprecated SemanticResourceAttributes.
Open Agent mode (Cmd/Ctrl+I) and reference the codebase so it picks up your existing entry point and dependencies:
@codebase Set up OpenTelemetry for our Node.js Express service.
Requirements:- Use @opentelemetry/sdk-node with getNodeAutoInstrumentations- Build the Resource with resourceFromAttributes() from @opentelemetry/resources (the Resource class is removed in resources 2.x) and ATTR_SERVICE_NAME / ATTR_SERVICE_VERSION from @opentelemetry/semantic-conventions- OTLP/gRPC exporters for traces and metrics- Disable the fs instrumentation (too noisy)- Load it via node --require before app code, not inlineReview the diff in the multi-file editor before accepting — Cursor will usually touch package.json, a new instrumentation.ts, and your start script.
Run it from the repo root so Claude can install packages and create the bootstrap file in one pass:
claude "Set up OpenTelemetry for this Express service. Use @opentelemetry/sdk-node,resourceFromAttributes() from @opentelemetry/resources (the Resource class is gone in 2.x),ATTR_SERVICE_NAME from @opentelemetry/semantic-conventions, and OTLP/gRPC exporters.Disable fs instrumentation. Wire it via --require in the start script."Claude installs the packages, writes instrumentation.ts, and patches your package.json start script. Use claude --output-format json if you want to capture the file list for a CI check.
Use the Codex CLI (running on GPT-5.5) from the repo. Sandbox and approvals keep package installs reviewable:
codex --ask-for-approval on-request \ "Set up OpenTelemetry for this Express service: @opentelemetry/sdk-node with getNodeAutoInstrumentations, resourceFromAttributes() from @opentelemetry/resources (Resource class removed in 2.x), ATTR_SERVICE_NAME from @opentelemetry/semantic-conventions, OTLP/gRPC exporters, fs instrumentation disabled, loaded via --require."For a hands-off run inside a clean checkout, codex --full-auto lets it install and edit without prompting. In the Codex IDE extension or Codex Cloud, paste the same instruction as a task — Cloud is handy when you want the change to land as a PR you review later.
The agent should produce something close to this. Note the 2.x API — resourceFromAttributes instead of new Resource(...), and ATTR_-prefixed constants instead of SemanticResourceAttributes:
// instrumentation.ts — loaded via: node --require ./instrumentation.ts app.jsimport { NodeSDK } from '@opentelemetry/sdk-node';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';import { resourceFromAttributes } from '@opentelemetry/resources';import { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION, ATTR_DEPLOYMENT_ENVIRONMENT_NAME,} from '@opentelemetry/semantic-conventions';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
const sdk = new NodeSDK({ resource: resourceFromAttributes({ [ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'checkout', [ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0', [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: process.env.NODE_ENV ?? 'development', }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }), metricReader: new PeriodicExportingMetricReader({ exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }), exportIntervalMillis: 30_000, }), instrumentations: [ getNodeAutoInstrumentations({ '@opentelemetry/instrumentation-fs': { enabled: false }, }), ],});
sdk.start();For business-level spans, import the API explicitly — trace and metrics come from @opentelemetry/api, which auto-instrumentation does not import for you:
import { trace } from '@opentelemetry/api';
export function startBusinessSpan(operation: string, attrs: Record<string, string>) { const tracer = trace.getTracer('business-operations'); return tracer.startSpan(operation, { attributes: { 'business.operation': operation, ...attrs } });}Step 2: Stand up the stack and dashboards
Section titled “Step 2: Stand up the stack and dashboards”Once spans and metrics are flowing, get the storage and visualization layer up. A typical stack is OTel Collector to fan out, Prometheus for metrics, Loki for logs, Tempo for traces, and Grafana on top. Ask your agent for the Compose file, then ask it for the dashboard.
The Loki config is where agents most often emit dead syntax. Insist on the current TSDB schema — boltdb-shipper, schema: v11, and the shared_store key are legacy and will fail to start on Loki 3.x:
# Loki 3.x — TSDB schema (v13). shared_store is removed; do not add it.schema_config: configs: - from: 2024-04-01 store: tsdb object_store: s3 schema: v13 index: prefix: index_ period: 24h
storage_config: tsdb_shipper: active_index_directory: /loki/tsdb-index cache_location: /loki/tsdb-cache aws: s3: s3://endpoint/bucket s3forcepathstyle: true
compactor: working_directory: /loki/compactor compaction_interval: 10m # storage backend comes from storage_config; no shared_store hereStep 3: Generate alert rules that fire on the right signal
Section titled “Step 3: Generate alert rules that fire on the right signal”Alert rules are easy to get subtly wrong — the metric name drifts, the threshold is arbitrary, or the rule references a metric your exporter no longer emits. Ask for SLO-aware rules and pin the metrics to what kube-state-metrics actually produces today.
# Prometheus alert rules. Memory headroom uses kube_pod_container_resource_limits,# not the removed cAdvisor container_spec_memory_limit_bytes metric.groups: - name: service_health interval: 30s rules: - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service) > 0.05 for: 5m labels: { severity: critical, team: platform } annotations: summary: "High error rate on {{ $labels.service }}" runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
- alert: PodMemoryUsageHigh expr: | sum(container_memory_working_set_bytes{pod!=""}) by (pod, container) / sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, container) > 0.9 for: 5m labels: { severity: warning } annotations: summary: "Pod {{ $labels.pod }} memory usage > 90%"MCP servers for monitoring
Section titled “MCP servers for monitoring”Two MCP servers turn your agent into an on-call assistant that can read live error and dashboard data. Both are vendor-official — be careful, because the npm registry has stub and third-party look-alikes for both.
Sentry
Section titled “Sentry”The canonical install is the remote server over HTTP with OAuth — no token to manage, and it works identically in Cursor, Claude Code, and Codex:
claude mcp add --transport http sentry https://mcp.sentry.dev/mcpIf you need a local stdio server (air-gapped, or self-hosted Sentry), use the official package @sentry/mcp-server — not sentry-mcp, which is an unrelated 0.0.1 stub. The env var is SENTRY_ACCESS_TOKEN, and self-hosted instances pass --host with the hostname only:
{ "mcpServers": { "sentry": { "command": "npx", "args": [ "-y", "@sentry/mcp-server", "--access-token=${SENTRY_ACCESS_TOKEN}", "--host=your-org.sentry.io" ] } }}Grafana
Section titled “Grafana”The official server is grafana/mcp-grafana, a Go binary distributed via uvx (or Docker/Helm) — not the third-party @leval/mcp-grafana npm wrapper. Authenticate with a service-account token (GRAFANA_SERVICE_ACCOUNT_TOKEN); Grafana API keys are deprecated:
{ "mcpServers": { "grafana": { "command": "uvx", "args": ["mcp-grafana"], "env": { "GRAFANA_URL": "https://grafana.example.com", "GRAFANA_SERVICE_ACCOUNT_TOKEN": "${GRAFANA_SERVICE_ACCOUNT_TOKEN}" } } }}When this breaks
Section titled “When this breaks”Three failure modes account for most wasted hours. Hand the symptom to your agent with the specific query it should run.
Prometheus slows to a crawl and memory balloons — usually a label like user_id or a raw URL exploding the time-series count.
Analyze our Prometheus instance for high-cardinality metrics:- Run topk(10, count by (__name__)({__name__=~".+"})) to find the worst metrics- Identify which labels are unbounded (user IDs, request IDs, raw paths)- Propose recording rules to pre-aggregate, and relabel_config to drop the offending labelsSpans appear for some services but vanish at a boundary — almost always broken context propagation or an over-aggressive sampler.
Debug why traces stop at our API gateway:- Confirm W3C traceparent headers are forwarded on outbound calls- Check the sampler config (is it tail-based dropping the children?)- Verify the OTLP exporter endpoint and TLS for the downstream service- Show me how to force-sample one request end-to-end to prove the pathOne incident fires fifty alerts and PagerDuty melts. The fix is grouping and inhibition, not more thresholds.
Our AlertManager floods during incidents. Rewrite the routing config to:- Group by alertname, cluster, and service with sane group_wait/group_interval- Add an inhibit_rule so a critical alert silences related warnings- Route critical to PagerDuty and warnings to Slack only during business hoursWhat’s next
Section titled “What’s next”- CI/CD Pipelines — wire dashboard and alert-rule validation into your deploy pipeline.
- Incident Response — use the signals you just instrumented to run a faster postmortem.
- Debugging Patterns — the trace-to-logs workflow when an alert finally fires.