Observability Setup via CLI

It is 2 AM and PagerDuty fires. The alert says “high error rate on API.” You open Grafana and see a spike, but the dashboard only shows HTTP status codes — no trace IDs, no log correlation, no way to tell which endpoint or which downstream service is the culprit. You SSH into the box, grep through unstructured logs, and spend forty minutes narrowing it down to a timeout in the payment provider’s API. The fix takes two minutes. The investigation took twenty times longer because observability was an afterthought.

What You’ll Walk Away With

A Claude Code workflow for generating OpenTelemetry instrumentation, structured logging, and Prometheus alert rules from your existing codebase
Copy-paste prompts that produce Grafana dashboard JSON, Alertmanager routing configs, and SLO definitions
A systematic approach to adding observability after the fact without rewriting your application

Bootstrapping OpenTelemetry in an Existing App

The hardest part of observability is the initial setup. You have a Node.js (or Python, or Go) application already in production, and adding tracing feels like surgery on a running patient. Claude Code makes this manageable because it can read your actual application entry point and produce instrumentation that fits.

Copy-paste prompt — OpenTelemetry bootstrap:

Read my application's entry point and main middleware stack. Generate an OpenTelemetry setup file (telemetry.ts) that: 1) auto-instruments HTTP, Express/Fastify, and database calls, 2) exports traces to an OTLP endpoint configured via OTEL_EXPORTER_OTLP_ENDPOINT env var, 3) adds service.name and deployment.environment resource attributes, 4) disables noisy fs instrumentation, 5) must be imported before any other module in the entry point. Also update my entry point to import this file first.

Claude Code reads your package.json to determine your framework (Express, Fastify, Hono, etc.) and your database library (Prisma, Drizzle, pg, Mongoose), then generates the exact instrumentation packages you need. No guessing which @opentelemetry/instrumentation-* package covers your stack.

After generating the setup, verify traces are flowing:

# Start the OTEL collector locally
docker run -p 4318:4318 otel/opentelemetry-collector-contrib

# Start your app with telemetry
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 node -r ./telemetry.js dist/index.js

# Hit an endpoint and check the collector logs
curl http://localhost:3000/api/health

Then iterate with Claude Code:

The traces are showing up but database queries don't include the SQL statement. Update the instrumentation config to capture the db.statement attribute for Prisma queries, but redact any parameter values to avoid leaking PII.

Structured Logging That Actually Helps

Unstructured logs are noise. Structured logs are data. Claude Code can retrofit structured logging into an existing codebase in one session.

Copy-paste prompt — structured logging retrofit:

Our app uses console.log everywhere for logging. Replace all console.log/warn/error calls with a structured logger (use pino for Node.js). The logger should: 1) output JSON in production, pretty-print in development, 2) include a correlationId from the request headers or generate one, 3) attach the active OpenTelemetry traceId and spanId to every log entry, 4) mask fields named password, token, secret, or creditCard in log output, 5) support log levels via LOG_LEVEL env var defaulting to info. Create the logger module and update all existing console calls.

This is a big change, and Claude Code handles it methodically. It creates the logger module first, then walks through each file replacing console.log calls. The key benefit is the trace correlation — every log entry can be linked back to a distributed trace, which turns your grep-based debugging into a searchable, correlated investigation.

After Claude makes the changes, verify with a quick sanity check:

Run a request through the app and show me a sample log entry to confirm the traceId and correlationId are present.

Prometheus Metrics and Alert Rules

Metrics without alerts are dashboards nobody watches. Claude Code can generate both the instrumentation and the alert rules in one pass.

Look at our API routes and generate Prometheus metrics for: 1) request count by method, route, and status code, 2) request duration histogram with buckets at 50ms, 100ms, 250ms, 500ms, 1s, 5s, 3) active database connection pool gauge, 4) business metrics for orders_created and payments_processed counters. Then generate a prometheus-rules.yml with alerts for: error rate above 1% for 5 minutes (critical), p95 latency above 2 seconds for 10 minutes (warning), and database connection pool above 80% utilization (warning).

Grafana Dashboards as Code

Grafana dashboards created by clicking in the UI tend to rot. Dashboards defined as JSON and stored in version control stay accurate. Claude Code can generate dashboard JSON that you commit alongside your application code.

Copy-paste prompt — Grafana dashboard JSON:

Generate a Grafana dashboard JSON file for our API service. Include panels for: 1) request rate by status code (timeseries), 2) p50/p95/p99 latency (timeseries), 3) error rate percentage (stat with red threshold at 1%), 4) top 5 slowest endpoints (table), 5) database query duration histogram (heatmap), 6) active connections gauge. Use the Prometheus datasource named 'default'. Set the dashboard to auto-refresh every 30 seconds with a default time range of 6 hours.

Commit the JSON to monitoring/dashboards/api-overview.json and provision it via Grafana’s dashboard provisioning. This way, dashboard changes go through code review just like application changes.

SLO Definitions and Error Budgets

Service Level Objectives turn vague “we should be fast” goals into measurable targets with burn-rate alerts.

Define SLOs for our API: 99.9% availability (successful responses / total responses) and 95% of requests under 500ms. Generate Prometheus recording rules for the SLO ratios, error budget remaining (over a 30-day window), and multi-window burn-rate alerts. Use the standard Google SRE approach with 1h/6h fast-burn and 3d/30d slow-burn windows. Also generate the Alertmanager routing that sends fast-burn alerts to PagerDuty and slow-burn alerts to Slack.

This produces a set of recording rules and alert rules that would take hours to write by hand, especially the multi-window burn-rate math. Claude Code gets the PromQL right because it has seen thousands of SLO implementations.

Monitoring Your Claude Code Usage

If your team uses Claude Code in CI (via the Claude Code GitHub Action, anthropics/claude-code-action, for PR reviews or automated fixes), you should track that usage too. Claude Code supports OpenTelemetry export for its own operations.

# Enable Claude Code telemetry export
export CLAUDE_CODE_ENABLE_TELEMETRY=1
# Port 4318 is the OTLP/HTTP endpoint, so pin the matching protocol.
# (The exporter defaults to grpc on :4317 -- pointing grpc at :4318 fails to connect.)
export OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318
export OTEL_RESOURCE_ATTRIBUTES="team=platform,cost_center=eng-42"

Then ask Claude Code to generate a dashboard for its own metrics:

Generate a Grafana dashboard for Claude Code team usage. Use the claude_code.* metric namespace. Include panels for: sessions per day by user, average tokens per session, tool acceptance rate (accepted edits / total edits), cost per team, and most-used languages.

When This Breaks

Traces are missing spans for certain routes. Auto-instrumentation covers standard HTTP handlers but misses custom middleware or queue consumers. Ask Claude Code: “Add manual span creation for our Bull queue job processors in workers/.” — it will generate the tracer.startActiveSpan wrapper around each job handler.

Log volume is exploding after the structured logging migration. Structured logs are bigger than console.log strings. Add log sampling for high-frequency, low-value paths: “Add log sampling at 10% for GET /api/health and GET /api/readiness routes. Log all errors at 100% regardless of sampling.”

Prometheus scrape targets are not discovered. If you use Kubernetes, service discovery relies on pod annotations. Ask Claude Code: “Add the prometheus.io/scrape, prometheus.io/port, and prometheus.io/path annotations to our Kubernetes deployment manifest so Prometheus discovers our metrics endpoint at /metrics on port 9090.” If your cluster runs the Prometheus Operator, skip the annotations and ask instead: “Generate a ServiceMonitor CRD that selects our service and scrapes /metrics on port 9090” — the operator picks up CRDs rather than pod annotations.

Grafana dashboard shows “No data” after deployment. Metric names or label values changed. Claude Code can help: “Compare our old Prometheus metric names with the new ones from the OpenTelemetry migration and generate a recording rules file that maps old names to new names for backward compatibility.”

Alert fatigue from too many warnings. The first version of alert rules is always too noisy. Iterate: “Review our current alert rules and increase the ‘for’ duration on warning-level alerts from 5 minutes to 15 minutes. Group related alerts so we get one notification per incident, not one per pod.”

What’s Next

Performance Analysis Use the observability data you just set up to find and fix performance bottlenecks

Security Auditing Add security-focused monitoring and audit logging to your observability stack

Deployment Automation Wire your monitoring into deploy scripts for automated rollback on anomalies