Skip to content

Observability Setup via CLI

It is 2 AM and PagerDuty fires. The alert says “high error rate on API.” You open Grafana and see a spike, but the dashboard only shows HTTP status codes — no trace IDs, no log correlation, no way to tell which endpoint or which downstream service is the culprit. You SSH into the box, grep through unstructured logs, and spend forty minutes narrowing it down to a timeout in the payment provider’s API. The fix takes two minutes. The investigation took twenty times longer because observability was an afterthought.

  • A Claude Code workflow for generating OpenTelemetry instrumentation, structured logging, and Prometheus alert rules from your existing codebase
  • Copy-paste prompts that produce Grafana dashboard JSON, Alertmanager routing configs, and SLO definitions
  • A systematic approach to adding observability after the fact without rewriting your application

Bootstrapping OpenTelemetry in an Existing App

Section titled “Bootstrapping OpenTelemetry in an Existing App”

The hardest part of observability is the initial setup. You have a Node.js (or Python, or Go) application already in production, and adding tracing feels like surgery on a running patient. Claude Code makes this manageable because it can read your actual application entry point and produce instrumentation that fits.

Claude Code reads your package.json to determine your framework (Express, Fastify, Hono, etc.) and your database library (Prisma, Drizzle, pg, Mongoose), then generates the exact instrumentation packages you need. No guessing which @opentelemetry/instrumentation-* package covers your stack.

After generating the setup, verify traces are flowing:

Terminal window
# Start the OTEL collector locally
docker run -p 4318:4318 otel/opentelemetry-collector-contrib
# Start your app with telemetry
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 node -r ./telemetry.js dist/index.js
# Hit an endpoint and check the collector logs
curl http://localhost:3000/api/health

Then iterate with Claude Code:

The traces are showing up but database queries don't include the SQL statement. Update the instrumentation config to capture the db.statement attribute for Prisma queries, but redact any parameter values to avoid leaking PII.

Unstructured logs are noise. Structured logs are data. Claude Code can retrofit structured logging into an existing codebase in one session.

This is a big change, and Claude Code handles it methodically. It creates the logger module first, then walks through each file replacing console.log calls. The key benefit is the trace correlation — every log entry can be linked back to a distributed trace, which turns your grep-based debugging into a searchable, correlated investigation.

After Claude makes the changes, verify with a quick sanity check:

Run a request through the app and show me a sample log entry to confirm the traceId and correlationId are present.

Metrics without alerts are dashboards nobody watches. Claude Code can generate both the instrumentation and the alert rules in one pass.

Look at our API routes and generate Prometheus metrics for: 1) request count by method, route, and status code, 2) request duration histogram with buckets at 50ms, 100ms, 250ms, 500ms, 1s, 5s, 3) active database connection pool gauge, 4) business metrics for orders_created and payments_processed counters. Then generate a prometheus-rules.yml with alerts for: error rate above 1% for 5 minutes (critical), p95 latency above 2 seconds for 10 minutes (warning), and database connection pool above 80% utilization (warning).

Grafana dashboards created by clicking in the UI tend to rot. Dashboards defined as JSON and stored in version control stay accurate. Claude Code can generate dashboard JSON that you commit alongside your application code.

Commit the JSON to monitoring/dashboards/api-overview.json and provision it via Grafana’s dashboard provisioning. This way, dashboard changes go through code review just like application changes.

Service Level Objectives turn vague “we should be fast” goals into measurable targets with burn-rate alerts.

Define SLOs for our API: 99.9% availability (successful responses / total responses) and 95% of requests under 500ms. Generate Prometheus recording rules for the SLO ratios, error budget remaining (over a 30-day window), and multi-window burn-rate alerts. Use the standard Google SRE approach with 1h/6h fast-burn and 3d/30d slow-burn windows. Also generate the Alertmanager routing that sends fast-burn alerts to PagerDuty and slow-burn alerts to Slack.

This produces a set of recording rules and alert rules that would take hours to write by hand, especially the multi-window burn-rate math. Claude Code gets the PromQL right because it has seen thousands of SLO implementations.

If your team uses Claude Code in CI (via claude-code-action for PR reviews or automated fixes), you should track that usage too. Claude Code supports OpenTelemetry export for its own operations.

Terminal window
# Enable Claude Code telemetry export
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318
export OTEL_RESOURCE_ATTRIBUTES="team=platform,cost_center=eng-42"

Then ask Claude Code to generate a dashboard for its own metrics:

Generate a Grafana dashboard for Claude Code team usage. Use the claude_code.* metric namespace. Include panels for: sessions per day by user, average tokens per session, tool acceptance rate (accepted edits / total edits), cost per team, and most-used languages.

Traces are missing spans for certain routes. Auto-instrumentation covers standard HTTP handlers but misses custom middleware or queue consumers. Ask Claude Code: “Add manual span creation for our Bull queue job processors in workers/.” — it will generate the tracer.startActiveSpan wrapper around each job handler.

Log volume is exploding after the structured logging migration. Structured logs are bigger than console.log strings. Add log sampling for high-frequency, low-value paths: “Add log sampling at 10% for GET /api/health and GET /api/readiness routes. Log all errors at 100% regardless of sampling.”

Prometheus scrape targets are not discovered. If you use Kubernetes, service discovery relies on pod annotations. Ask Claude Code: “Add the prometheus.io/scrape, prometheus.io/port, and prometheus.io/path annotations to our Kubernetes deployment manifest so Prometheus discovers our metrics endpoint at /metrics on port 9090.”

Grafana dashboard shows “No data” after deployment. Metric names or label values changed. Claude Code can help: “Compare our old Prometheus metric names with the new ones from the OpenTelemetry migration and generate a recording rules file that maps old names to new names for backward compatibility.”

Alert fatigue from too many warnings. The first version of alert rules is always too noisy. Iterate: “Review our current alert rules and increase the ‘for’ duration on warning-level alerts from 5 minutes to 15 minutes. Group related alerts so we get one notification per incident, not one per pod.”