Skip to content

Monitoring and Observability

Your checkout service is throwing intermittent 504s. The traces stop at the gateway span, the dashboards show “everything green,” and your PM wants an ETA. You suspect the payment provider, but you can’t prove it because half your services aren’t instrumented and the ones that are export to a Jaeger instance nobody looks at. This is the gap AI coding agents close fastest: turning a half-wired observability stack into traces, metrics, dashboards, and alerts that actually answer “what broke and why.”

  • A working OpenTelemetry setup for Node.js using the current @opentelemetry/resources 2.x API (no deprecated Resource class).
  • Three tool-specific workflows (Cursor, Claude Code, Codex) for instrumenting a service and standing up a Grafana stack.
  • Copy-paste prompts for RED-metric dashboards, signal correlation, and bottleneck hunting that work with minimal editing.
  • The correct, verified MCP configs for Sentry and Grafana — the two that actually exist and ship from the vendors.
  • A failure-mode playbook for the three observability problems that waste the most time: high cardinality, missing traces, and alert storms.

Step 1: Instrument the service with OpenTelemetry

Section titled “Step 1: Instrument the service with OpenTelemetry”

Start by asking your agent to wire up OTel. The instrumentation workflow is largely identical across all three tools — the difference is how you invoke them and how each handles the multi-file edit. Point the agent at the current 2.x API explicitly; left to its own devices it will reach for the removed Resource class and the deprecated SemanticResourceAttributes.

Open Agent mode (Cmd/Ctrl+I) and reference the codebase so it picks up your existing entry point and dependencies:

@codebase Set up OpenTelemetry for our Node.js Express service.
Requirements:
- Use @opentelemetry/sdk-node with getNodeAutoInstrumentations
- Build the Resource with resourceFromAttributes() from @opentelemetry/resources
(the Resource class is removed in resources 2.x) and ATTR_SERVICE_NAME /
ATTR_SERVICE_VERSION from @opentelemetry/semantic-conventions
- OTLP/gRPC exporters for traces and metrics
- Disable the fs instrumentation (too noisy)
- Load it via node --require before app code, not inline

Review the diff in the multi-file editor before accepting — Cursor will usually touch package.json, a new instrumentation.ts, and your start script.

The agent should produce something close to this. Note the 2.x API — resourceFromAttributes instead of new Resource(...), and ATTR_-prefixed constants instead of SemanticResourceAttributes:

// instrumentation.ts — loaded via: node --require ./instrumentation.ts app.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { resourceFromAttributes } from '@opentelemetry/resources';
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from '@opentelemetry/semantic-conventions';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
const sdk = new NodeSDK({
resource: resourceFromAttributes({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'checkout',
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION ?? '0.0.0',
[ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: process.env.NODE_ENV ?? 'development',
}),
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
exportIntervalMillis: 30_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false },
}),
],
});
sdk.start();

For business-level spans, import the API explicitly — trace and metrics come from @opentelemetry/api, which auto-instrumentation does not import for you:

import { trace } from '@opentelemetry/api';
export function startBusinessSpan(operation: string, attrs: Record<string, string>) {
const tracer = trace.getTracer('business-operations');
return tracer.startSpan(operation, { attributes: { 'business.operation': operation, ...attrs } });
}

Once spans and metrics are flowing, get the storage and visualization layer up. A typical stack is OTel Collector to fan out, Prometheus for metrics, Loki for logs, Tempo for traces, and Grafana on top. Ask your agent for the Compose file, then ask it for the dashboard.

The Loki config is where agents most often emit dead syntax. Insist on the current TSDB schema — boltdb-shipper, schema: v11, and the shared_store key are legacy and will fail to start on Loki 3.x:

# Loki 3.x — TSDB schema (v13). shared_store is removed; do not add it.
schema_config:
configs:
- from: 2024-04-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: index_
period: 24h
storage_config:
tsdb_shipper:
active_index_directory: /loki/tsdb-index
cache_location: /loki/tsdb-cache
aws:
s3: s3://endpoint/bucket
s3forcepathstyle: true
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
# storage backend comes from storage_config; no shared_store here

Step 3: Generate alert rules that fire on the right signal

Section titled “Step 3: Generate alert rules that fire on the right signal”

Alert rules are easy to get subtly wrong — the metric name drifts, the threshold is arbitrary, or the rule references a metric your exporter no longer emits. Ask for SLO-aware rules and pin the metrics to what kube-state-metrics actually produces today.

# Prometheus alert rules. Memory headroom uses kube_pod_container_resource_limits,
# not the removed cAdvisor container_spec_memory_limit_bytes metric.
groups:
- name: service_health
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) > 0.05
for: 5m
labels: { severity: critical, team: platform }
annotations:
summary: "High error rate on {{ $labels.service }}"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
- alert: PodMemoryUsageHigh
expr: |
sum(container_memory_working_set_bytes{pod!=""}) by (pod, container)
/ sum(kube_pod_container_resource_limits{resource="memory"}) by (pod, container) > 0.9
for: 5m
labels: { severity: warning }
annotations:
summary: "Pod {{ $labels.pod }} memory usage > 90%"

Two MCP servers turn your agent into an on-call assistant that can read live error and dashboard data. Both are vendor-official — be careful, because the npm registry has stub and third-party look-alikes for both.

The canonical install is the remote server over HTTP with OAuth — no token to manage, and it works identically in Cursor, Claude Code, and Codex:

Terminal window
claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

If you need a local stdio server (air-gapped, or self-hosted Sentry), use the official package @sentry/mcp-servernot sentry-mcp, which is an unrelated 0.0.1 stub. The env var is SENTRY_ACCESS_TOKEN, and self-hosted instances pass --host with the hostname only:

{
"mcpServers": {
"sentry": {
"command": "npx",
"args": [
"-y",
"@sentry/mcp-server",
"--access-token=${SENTRY_ACCESS_TOKEN}",
"--host=your-org.sentry.io"
]
}
}
}

The official server is grafana/mcp-grafana, a Go binary distributed via uvx (or Docker/Helm) — not the third-party @leval/mcp-grafana npm wrapper. Authenticate with a service-account token (GRAFANA_SERVICE_ACCOUNT_TOKEN); Grafana API keys are deprecated:

{
"mcpServers": {
"grafana": {
"command": "uvx",
"args": ["mcp-grafana"],
"env": {
"GRAFANA_URL": "https://grafana.example.com",
"GRAFANA_SERVICE_ACCOUNT_TOKEN": "${GRAFANA_SERVICE_ACCOUNT_TOKEN}"
}
}
}
}

Three failure modes account for most wasted hours. Hand the symptom to your agent with the specific query it should run.

Prometheus slows to a crawl and memory balloons — usually a label like user_id or a raw URL exploding the time-series count.

Analyze our Prometheus instance for high-cardinality metrics:
- Run topk(10, count by (__name__)({__name__=~".+"})) to find the worst metrics
- Identify which labels are unbounded (user IDs, request IDs, raw paths)
- Propose recording rules to pre-aggregate, and relabel_config to drop the offending labels
  • CI/CD Pipelines — wire dashboard and alert-rule validation into your deploy pipeline.
  • Incident Response — use the signals you just instrumented to run a faster postmortem.
  • Debugging Patterns — the trace-to-logs workflow when an alert finally fires.