Observability Setup
Observability is the difference between guessing and knowing what’s happening in your systems. Whether you’re tracking performance, debugging issues, or optimizing costs, Claude Code transforms monitoring from an afterthought into a first-class development practice. This lesson explores how to leverage AI assistance for comprehensive observability.
The Observability Revolution
Section titled “The Observability Revolution”Scenario: Your application serves millions of requests daily across dozens of microservices. When something goes wrong, you need answers in seconds, not hours. Traditional approach: grep through logs, guess at correlations, hope for the best. With Claude Code: intelligent monitoring that tells you what’s wrong before users notice.
Traditional vs AI-Powered Observability
Section titled “Traditional vs AI-Powered Observability”Week 1: Basic Logging- Console.log everywhere- No structure or consistency- Logs lost on container restart
Week 2: Metrics Setup- Manual metric instrumentation- Basic CPU/memory graphs- No custom business metrics
Week 3: Debugging Crisis- Production issue at 3 AM- Grep through gigabytes of logs- No correlation between services
Week 4: Post-Mortem- "We need better monitoring"- Start over with new tools
Day 1: Complete Setup> Set up OpenTelemetry with:> - Structured logging> - Distributed tracing> - Custom metrics> - Error tracking
Day 2: Dashboards & Alerts> Create Grafana dashboards> Configure intelligent alerts> Set up anomaly detection
Hour 3: Incident Response> Correlate logs, metrics, traces> Identify root cause quickly> Generate runbook automatically
Ongoing: Continuous Improvement> Suggest new metrics> Optimize performance> Reduce monitoring costs
OpenTelemetry Integration
Section titled “OpenTelemetry Integration”Complete OpenTelemetry Setup
Section titled “Complete OpenTelemetry Setup”> Set up OpenTelemetry for our Node.js application:> - Automatic instrumentation> - Custom spans and metrics> - Log correlation> - Export to multiple backends> - Environment-specific configuration
Claude generates a comprehensive setup:
- Install Dependencies
npm install @opentelemetry/api \ @opentelemetry/sdk-node \ @opentelemetry/auto-instrumentations-node \ @opentelemetry/exporter-metrics-otlp-http \ @opentelemetry/exporter-trace-otlp-http \ @opentelemetry/instrumentation-http \ @opentelemetry/instrumentation-express \ @opentelemetry/instrumentation-winston
-
Create Telemetry Configuration
telemetry.js const { NodeSDK } = require('@opentelemetry/sdk-node');const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');const { Resource } = require('@opentelemetry/resources');const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');const { OTLPMetricExporter } = require('@opentelemetry/exporter-metrics-otlp-http');const { PeriodicExportingMetricReader } = require('@opentelemetry/sdk-metrics');// Configure resource attributesconst resource = Resource.default().merge(new Resource({[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME || 'myapp',[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION || '1.0.0',[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || 'development','team.name': 'platform','team.owner': 'engineering'}));// Configure trace exporterconst traceExporter = new OTLPTraceExporter({url: process.env.OTEL_EXPORTER_OTLP_TRACES_ENDPOINT || 'http://localhost:4318/v1/traces',headers: {'api-key': process.env.OTEL_EXPORTER_API_KEY}});// Configure metric exporterconst metricExporter = new OTLPMetricExporter({url: process.env.OTEL_EXPORTER_OTLP_METRICS_ENDPOINT || 'http://localhost:4318/v1/metrics',headers: {'api-key': process.env.OTEL_EXPORTER_API_KEY}});// Initialize SDKconst sdk = new NodeSDK({resource,traceExporter,metricReader: new PeriodicExportingMetricReader({exporter: metricExporter,exportIntervalMillis: 10000}),instrumentations: [getNodeAutoInstrumentations({'@opentelemetry/instrumentation-fs': {enabled: false // Disable noisy fs instrumentation}})]});// Initialize and exportsdk.start().then(() => console.log('OpenTelemetry initialized')).catch((error) => console.error('Failed to initialize OpenTelemetry', error));module.exports = { sdk }; -
Add to Application Entry Point
// index.js - Must be first import!require('./telemetry');const express = require('express');const { trace, metrics } = require('@opentelemetry/api');// Get tracer and meterconst tracer = trace.getTracer('myapp');const meter = metrics.getMeter('myapp');// Create custom metricsconst requestCounter = meter.createCounter('http_requests', {description: 'Count of HTTP requests',unit: '1'});const requestDuration = meter.createHistogram('http_request_duration', {description: 'Duration of HTTP requests',unit: 'ms'});const app = express();// Middleware for custom metricsapp.use((req, res, next) => {const start = Date.now();res.on('finish', () => {const duration = Date.now() - start;const labels = {method: req.method,route: req.route?.path || 'unknown',status_code: res.statusCode.toString()};requestCounter.add(1, labels);requestDuration.record(duration, labels);});next();}); -
Add Custom Spans
// Custom instrumentation exampleasync function processOrder(orderId) {// Start a new spanreturn tracer.startActiveSpan('process_order', async (span) => {try {// Add span attributesspan.setAttributes({'order.id': orderId,'order.processing_step': 'validation'});// Validate orderconst order = await validateOrder(orderId);// Create child span for paymentawait tracer.startActiveSpan('process_payment', async (paymentSpan) => {paymentSpan.setAttributes({'payment.amount': order.total,'payment.currency': order.currency});await processPayment(order);paymentSpan.setStatus({ code: SpanStatusCode.OK });});// Record custom metricorderProcessingCounter.add(1, {status: 'success',payment_method: order.paymentMethod});span.setStatus({ code: SpanStatusCode.OK });return order;} catch (error) {span.recordException(error);span.setStatus({code: SpanStatusCode.ERROR,message: error.message});throw error;} finally {span.end();}});}
Distributed Tracing
Section titled “Distributed Tracing”> Set up distributed tracing across our microservices:> - Trace context propagation> - Service dependency mapping> - Critical path analysis> - Performance bottleneck identification
// Trace context propagationconst { propagation, trace, context } = require('@opentelemetry/api');
// HTTP client with trace propagationasync function callDownstreamService(url, data) { const span = trace.getActiveSpan();
// Create headers with trace context const headers = {}; propagation.inject(context.active(), headers);
try { const response = await fetch(url, { method: 'POST', headers: { ...headers, 'Content-Type': 'application/json' }, body: JSON.stringify(data) });
span?.addEvent('downstream_service_called', { 'http.url': url, 'http.status_code': response.status });
return response.json(); } catch (error) { span?.recordException(error); throw error; }}
// Extract trace context in receiving serviceapp.use((req, res, next) => { // Extract trace context from incoming request const extractedContext = propagation.extract( context.active(), req.headers );
// Continue trace with extracted context context.with(extractedContext, () => { next(); });});
Structured Logging
Section titled “Structured Logging”Implementing Structured Logs
Section titled “Implementing Structured Logs”> Set up structured logging with:> - JSON format for easy parsing> - Correlation IDs> - Log levels and sampling> - Integration with OpenTelemetry> - Sensitive data masking
const winston = require('winston');const { trace, context } = require('@opentelemetry/api');
// Custom format that includes trace informationconst traceFormat = winston.format((info) => { const span = trace.getActiveSpan(); if (span) { const spanContext = span.spanContext(); info.traceId = spanContext.traceId; info.spanId = spanContext.spanId; }
// Add request ID if available const requestId = context.active().getValue('requestId'); if (requestId) { info.requestId = requestId; }
return info;});
// Sensitive data maskingconst maskSensitive = winston.format((info) => { const sensitive = ['password', 'token', 'apiKey', 'ssn', 'creditCard'];
const mask = (obj) => { if (typeof obj !== 'object' || obj === null) return obj;
const masked = Array.isArray(obj) ? [] : {};
for (const [key, value] of Object.entries(obj)) { if (sensitive.some(s => key.toLowerCase().includes(s))) { masked[key] = '***REDACTED***'; } else if (typeof value === 'object' && value !== null) { masked[key] = mask(value); } else { masked[key] = value; } }
return masked; };
if (info.meta) { info.meta = mask(info.meta); }
return info;});
// Create logger instanceconst logger = winston.createLogger({ level: process.env.LOG_LEVEL || 'info', format: winston.format.combine( winston.format.timestamp(), winston.format.errors({ stack: true }), traceFormat(), maskSensitive(), winston.format.json() ), defaultMeta: { service: process.env.SERVICE_NAME || 'myapp', environment: process.env.NODE_ENV || 'development' }, transports: [ new winston.transports.Console({ format: process.env.NODE_ENV === 'development' ? winston.format.combine( winston.format.colorize(), winston.format.simple() ) : winston.format.json() }) ]});
// Add log sampling for high-volume logslogger.sample = (rate = 0.1) => { return { log: (level, message, meta) => { if (Math.random() < rate) { logger.log(level, message, { ...meta, sampled: true }); } } };};
module.exports = logger;
Log Aggregation & Analysis
Section titled “Log Aggregation & Analysis”// Log aggregation patternsclass LogAggregator { constructor() { this.buffers = new Map(); this.flushInterval = 5000; // 5 seconds
setInterval(() => this.flush(), this.flushInterval); }
aggregate(key, data) { if (!this.buffers.has(key)) { this.buffers.set(key, { count: 0, firstSeen: Date.now(), lastSeen: Date.now(), samples: [] }); }
const buffer = this.buffers.get(key); buffer.count++; buffer.lastSeen = Date.now();
// Keep only first 5 samples if (buffer.samples.length < 5) { buffer.samples.push(data); } }
flush() { for (const [key, buffer] of this.buffers) { if (buffer.count > 0) { logger.info('Aggregated log entry', { key, count: buffer.count, duration: buffer.lastSeen - buffer.firstSeen, samples: buffer.samples }); } }
this.buffers.clear(); }}
// Usage for high-frequency eventsconst aggregator = new LogAggregator();
// Instead of logging every requestapp.use((req, res, next) => { aggregator.aggregate(`request:${req.method}:${req.path}`, { userAgent: req.headers['user-agent'], ip: req.ip }); next();});
Custom Metrics & KPIs
Section titled “Custom Metrics & KPIs”Business Metrics Implementation
Section titled “Business Metrics Implementation”> Implement custom business metrics:> - Order processing time> - Revenue per minute> - Cart abandonment rate> - API success rates> - Feature adoption metrics
const { metrics } = require('@opentelemetry/api');const meter = metrics.getMeter('business-metrics');
// Business metric definitionsconst orderCounter = meter.createCounter('orders_total', { description: 'Total number of orders', unit: '1'});
const revenueCounter = meter.createCounter('revenue_total', { description: 'Total revenue', unit: 'USD'});
const cartAbandonmentGauge = meter.createUpDownCounter('cart_abandonment', { description: 'Number of abandoned carts', unit: '1'});
const apiLatencyHistogram = meter.createHistogram('api_latency', { description: 'API endpoint latency', unit: 'ms'});
const activeUsersGauge = meter.createObservableGauge('active_users', { description: 'Number of active users'});
// Set up observable gauge callbackactiveUsersGauge.addCallback(async (observableResult) => { const count = await getActiveUserCount(); observableResult.observe(count, { period: '5m' });});
// Business metric helpersclass BusinessMetrics { static recordOrder(order) { orderCounter.add(1, { status: order.status, payment_method: order.paymentMethod, customer_type: order.isNewCustomer ? 'new' : 'returning' });
revenueCounter.add(order.total, { currency: order.currency, region: order.region }); }
static recordCartAbandonment(cart) { cartAbandonmentGauge.add(1, { value: cart.total, items_count: cart.items.length, reason: cart.abandonmentReason || 'unknown' }); }
static recordApiCall(endpoint, method, duration, success) { apiLatencyHistogram.record(duration, { endpoint, method, success: success.toString() }); }
static async recordFeatureUsage(feature, userId) { const featureCounter = meter.createCounter(`feature_usage_${feature}`, { description: `Usage of ${feature} feature` });
featureCounter.add(1, { user_segment: await getUserSegment(userId), first_time: await isFirstTimeUsage(userId, feature) }); }}
module.exports = BusinessMetrics;
Performance Metrics
Section titled “Performance Metrics”// Performance monitoringconst performanceObserver = new PerformanceObserver((list) => { for (const entry of list.getEntries()) { // Record performance metrics const histogram = meter.createHistogram(`performance_${entry.entryType}`, { description: `Performance timing for ${entry.entryType}`, unit: 'ms' });
histogram.record(entry.duration, { name: entry.name, type: entry.entryType }); }});
performanceObserver.observe({ entryTypes: ['measure', 'navigation', 'resource']});
// Database query performanceconst dbQueryHistogram = meter.createHistogram('db_query_duration', { description: 'Database query execution time', unit: 'ms'});
// Wrap database queriesasync function instrumentedQuery(sql, params) { const startTime = performance.now(); const labels = { operation: sql.split(' ')[0].toUpperCase(), table: extractTableName(sql) };
try { const result = await db.query(sql, params); labels.success = 'true'; return result; } catch (error) { labels.success = 'false'; labels.error_type = error.constructor.name; throw error; } finally { const duration = performance.now() - startTime; dbQueryHistogram.record(duration, labels); }}
Monitoring Infrastructure
Section titled “Monitoring Infrastructure”Prometheus Setup
Section titled “Prometheus Setup”> Set up Prometheus for metrics collection:> - Scrape configuration> - Recording rules> - Alerting rules> - Federation setup
global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' region: 'us-east-1'
# Alerting configurationalerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
# Recording rulesrule_files: - 'recording_rules.yml' - 'alerting_rules.yml'
# Scrape configurationsscrape_configs: # Application metrics - job_name: 'myapp' static_configs: - targets: ['myapp:9090'] relabel_configs: - source_labels: [__address__] target_label: instance regex: '([^:]+):.*' replacement: '${1}'
# Kubernetes service discovery - job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__
# recording_rules.ymlgroups: - name: myapp_recording_rules interval: 30s rules: # Request rate - record: myapp:http_requests:rate5m expr: rate(http_requests_total[5m])
# Error rate - record: myapp:http_errors:rate5m expr: rate(http_requests_total{status=~"5.."}[5m])
# P95 latency - record: myapp:http_latency:p95 expr: histogram_quantile(0.95, rate(http_request_duration_bucket[5m]))
# Business metrics - record: myapp:orders:rate1h expr: rate(orders_total[1h])
- record: myapp:revenue:rate1h expr: rate(revenue_total[1h])
# alerting_rules.ymlgroups: - name: myapp_alerts rules: - alert: HighErrorRate expr: myapp:http_errors:rate5m > 0.05 for: 5m labels: severity: critical team: platform annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.instance }}"
- alert: HighLatency expr: myapp:http_latency:p95 > 1000 for: 10m labels: severity: warning annotations: summary: "High latency detected" description: "P95 latency is {{ $value }}ms"
- alert: LowOrderRate expr: myapp:orders:rate1h < 10 for: 30m labels: severity: warning team: business annotations: summary: "Low order rate" description: "Order rate dropped to {{ $value }} orders/hour"
Grafana Dashboards
Section titled “Grafana Dashboards”> Create comprehensive Grafana dashboards:> - System overview> - Business metrics> - Performance analysis> - Error tracking> - SLA monitoring
{ "dashboard": { "title": "MyApp Production Dashboard", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (method)", "legendFormat": "{{method}}" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 } }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100", "legendFormat": "Error %" } ], "alert": { "conditions": [ { "evaluator": { "params": [5], "type": "gt" }, "operator": { "type": "and" }, "query": { "params": ["A", "5m", "now"] }, "reducer": { "params": [], "type": "avg" } } ] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 } }, { "title": "Response Time Percentiles", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_bucket[5m])) by (le))", "legendFormat": "p99" }, { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_bucket[5m])) by (le))", "legendFormat": "p95" }, { "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_bucket[5m])) by (le))", "legendFormat": "p50" } ], "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 } }, { "title": "Business KPIs", "type": "stat", "targets": [ { "expr": "sum(rate(orders_total[1h])) * 3600", "legendFormat": "Orders/Hour" }, { "expr": "sum(rate(revenue_total[1h])) * 3600", "legendFormat": "Revenue/Hour" }, { "expr": "sum(active_users)", "legendFormat": "Active Users" } ], "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 } } ] }}
Error Tracking & Alerting
Section titled “Error Tracking & Alerting”Comprehensive Error Tracking
Section titled “Comprehensive Error Tracking”> Set up error tracking with:> - Automatic error capture> - Stack trace collection> - User context> - Release tracking> - Error grouping
const Sentry = require('@sentry/node');const { ProfilingIntegration } = require('@sentry/profiling-node');
// Initialize SentrySentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV, release: process.env.SERVICE_VERSION, integrations: [ new Sentry.Integrations.Http({ tracing: true }), new Sentry.Integrations.Express({ app }), new ProfilingIntegration() ], tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0, profilesSampleRate: 0.1, beforeSend(event, hint) { // Filter out known issues if (event.exception?.values?.[0]?.type === 'NetworkError') { return null; }
// Add custom context event.extra = { ...event.extra, nodeVersion: process.version, memory: process.memoryUsage() };
return event; }});
// Error handler middlewareapp.use((err, req, res, next) => { // Log to our logger logger.error('Unhandled error', { error: err.message, stack: err.stack, url: req.url, method: req.method, ip: req.ip, userAgent: req.get('user-agent') });
// Send to Sentry with context Sentry.withScope((scope) => { scope.setContext('request', { url: req.url, method: req.method, headers: req.headers, query: req.query, body: req.body });
scope.setUser({ id: req.user?.id, email: req.user?.email, ip_address: req.ip });
scope.setTag('endpoint', req.route?.path || 'unknown'); scope.setLevel('error');
Sentry.captureException(err); });
// Send error response res.status(err.status || 500).json({ error: { message: process.env.NODE_ENV === 'production' ? 'Internal server error' : err.message, id: res.sentry } });});
Intelligent Alerting
Section titled “Intelligent Alerting”global: resolve_timeout: 5m slack_api_url: 'YOUR_SLACK_WEBHOOK'
route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: - match: severity: critical receiver: pagerduty continue: true - match: team: platform receiver: platform-slack - match: team: business receiver: business-alerts
receivers: - name: 'default' slack_configs: - channel: '#alerts' title: '{{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty' pagerduty_configs: - service_key: 'YOUR_PAGERDUTY_KEY' description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'platform-slack' slack_configs: - channel: '#platform-alerts' send_resolved: true title: '🚨 {{ .GroupLabels.alertname }}' text: | *Alert:* {{ .GroupLabels.alertname }} *Severity:* {{ .CommonLabels.severity }} *Description:* {{ .CommonAnnotations.description }} *Runbook:* <{{ .CommonAnnotations.runbook_url }}|View Runbook>
inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'cluster', 'service']
Claude Code Telemetry
Section titled “Claude Code Telemetry”Monitoring Claude Code Usage
Section titled “Monitoring Claude Code Usage”> Set up Claude Code telemetry monitoring:> - Usage metrics> - Cost tracking> - Performance analysis> - Team adoption metrics
# Enable Claude Code telemetryexport CLAUDE_CODE_ENABLE_TELEMETRY=1export OTEL_METRICS_EXPORTER=otlpexport OTEL_LOGS_EXPORTER=otlpexport OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer your-token"
# Add custom attributes for team trackingexport OTEL_RESOURCE_ATTRIBUTES="department=engineering,team.id=platform,cost_center=eng-123"
// Claude Code metrics dashboard queriesconst claudeMetrics = { // Developer productivity linesOfCode: ` sum(rate(claude_code.lines_of_code.count[1d])) by (user_account_uuid, type) `,
// Cost tracking costPerTeam: ` sum(claude_code.cost.usage) by (department, team_id) `,
// Tool acceptance rate acceptanceRate: ` sum(rate(claude_code.code_edit_tool.decision{decision="accept"}[1h])) / sum(rate(claude_code.code_edit_tool.decision[1h])) * 100 `,
// Active developers activeDevelopers: ` count(count by (user_account_uuid)(claude_code.session.count)) `,
// Language usage languageDistribution: ` sum(claude_code.code_edit_tool.decision) by (language) `};
Performance Monitoring
Section titled “Performance Monitoring”Application Performance Monitoring (APM)
Section titled “Application Performance Monitoring (APM)”> Set up comprehensive APM:> - Transaction tracing> - Database query analysis> - External service monitoring> - Resource utilization
// APM integrationconst apm = require('elastic-apm-node').start({ serviceName: process.env.SERVICE_NAME, secretToken: process.env.ELASTIC_APM_SECRET_TOKEN, serverUrl: process.env.ELASTIC_APM_SERVER_URL, environment: process.env.NODE_ENV, transactionSampleRate: 0.1, captureBody: 'errors', errorOnAbortedRequests: true, captureErrorLogStackTraces: 'always', usePathAsTransactionName: false});
// Custom transaction trackingasync function complexBusinessOperation(data) { const transaction = apm.startTransaction('process_order', 'business');
try { // Track database operations const span = apm.startSpan('validate_inventory', 'db'); const inventory = await checkInventory(data.items); span.end();
// Track external API calls const paymentSpan = apm.startSpan('process_payment', 'external'); const payment = await processPayment(data.payment); paymentSpan.end();
// Track custom operations const fulfillmentSpan = apm.startSpan('create_fulfillment', 'custom'); const order = await createFulfillmentOrder(data, payment); fulfillmentSpan.end();
transaction.result = 'success'; return order; } catch (error) { apm.captureError(error); transaction.result = 'error'; throw error; } finally { transaction.end(); }}
SLO & SLA Monitoring
Section titled “SLO & SLA Monitoring”Service Level Monitoring
Section titled “Service Level Monitoring”> Implement SLO monitoring:> - Availability targets> - Latency objectives> - Error budgets> - Burn rate alerts
groups: - name: slo_rules interval: 30s rules: # Availability SLO - 99.9% - record: slo:availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Error budget burn rate - record: slo:error_budget:burn_rate_1h expr: | (1 - slo:availability:ratio) / (1 - 0.999) * (30 * 24) / 1
- record: slo:error_budget:burn_rate_6h expr: | (1 - slo:availability:ratio) / (1 - 0.999) * (30 * 24) / 6
# Latency SLO - 95% of requests under 500ms - record: slo:latency:ratio expr: | sum(rate(http_request_duration_bucket{le="0.5"}[5m])) / sum(rate(http_request_duration_count[5m]))
# Multi-window multi-burn-rate alerts - alert: ErrorBudgetBurn expr: | ( slo:error_budget:burn_rate_1h > 14.4 and slo:error_budget:burn_rate_6h > 6 ) labels: severity: critical slo: availability annotations: summary: "Error budget is burning too fast" description: "Error budget burn rate is {{ $value }} times normal"
Related Lessons
Section titled “Related Lessons”Next Steps
Section titled “Next Steps”You’ve learned how to leverage Claude Code for comprehensive observability - from instrumentation to visualization to alerting. The key is treating monitoring as a first-class citizen in your development process, not an afterthought.
Remember: You can’t fix what you can’t see. Use Claude Code to build observability into your applications from the start, creating systems that tell you what’s wrong before your users do. With proper monitoring, you’ll ship with confidence and sleep better at night.