Przejdź do głównej zawartości

Monitoring and Observability

Ta treść nie jest jeszcze dostępna w Twoim języku.

Modern distributed systems require sophisticated monitoring and observability solutions. AI coding assistants excel at implementing OpenTelemetry, creating dashboards, writing queries, and setting up alerting - transforming observability from a burden into a strategic advantage.

Traditional monitoring approaches struggle with:

  • Complex distributed architectures spanning multiple services
  • Massive data volumes from logs, metrics, and traces
  • Alert fatigue from poorly tuned thresholds
  • Manual correlation across different telemetry types
  • Reactive incident response rather than proactive prevention

AI assistants help by:

  • Generating instrumentation code with proper context propagation
  • Creating intelligent queries that surface insights
  • Building comprehensive dashboards tailored to your stack
  • Writing alert rules that reduce noise
  • Automating incident correlation across signals
@codebase "Set up OpenTelemetry for our Node.js microservices:
- Automatic instrumentation for Express, MongoDB, Redis
- Custom spans for business operations
- Trace context propagation across services
- Metrics for SLIs (latency, error rate, throughput)
- Structured logging with trace correlation
- Export to both Jaeger and Prometheus"
// AI-generated comprehensive OTel setup
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
// Initialize OpenTelemetry
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: process.env.SERVICE_NAME,
[SemanticResourceAttributes.SERVICE_VERSION]: process.env.SERVICE_VERSION,
[SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
exportIntervalMillis: 30000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': {
enabled: false, // Too noisy
},
}),
],
});
// Custom business instrumentation
export function createBusinessSpan(operation: string) {
const tracer = trace.getTracer('business-operations');
return tracer.startSpan(operation, {
attributes: {
'business.operation': operation,
'business.user_id': getCurrentUserId(),
'business.tenant': getTenantId(),
},
});
}
// Structured logging with trace correlation
export function getLogger(component: string) {
return {
info: (message: string, data?: any) => {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level: 'info',
component,
message,
traceId,
...data,
}));
},
// Similar for error, warn, debug
};
}
graph TB subgraph Applications A1[Service A] A2[Service B] A3[Service C] end subgraph Collection OC[OTel Collector] FA[Fluent Bit] end subgraph Storage P[Prometheus] L[Loki] T[Tempo] end subgraph Visualization G[Grafana] AM[Alert Manager] end A1 --> OC A2 --> OC A3 --> OC A1 --> FA A2 --> FA A3 --> FA OC --> P OC --> T FA --> L P --> G L --> G T --> G G --> AM

AI Prompt for Stack Setup

"Create a complete monitoring stack with:
- OpenTelemetry Collector with proper configuration
- Prometheus for metrics with retention policies
- Loki for logs with S3 backend
- Tempo for traces with search capabilities
- Grafana with pre-configured data sources
- Proper networking and persistence"

The AI will generate a comprehensive docker-compose.yml with all services properly configured and connected.

@prometheus @grafana "Create a service health dashboard showing:
- Request rate, error rate, duration (RED metrics)
- P50, P95, P99 latencies by endpoint
- Error budget burn rate
- Active alerts and their status
- Resource utilization (CPU, memory, connections)
- Dependency health status
Use PromQL best practices and make it responsive"
  1. Define Key Metrics

    "Identify our key business metrics:
    - User engagement (DAU, MAU, retention)
    - Transaction success rates
    - Revenue impact of errors
    - Feature adoption rates"
  2. Generate Queries

    "Write PromQL queries for these business metrics,
    correlating technical metrics with business impact"
  3. Create Visualizations

    "Design Grafana panels that tell a story,
    with drill-down capabilities from business to technical metrics"
// AI-generated trace propagation middleware
export const traceMiddleware = (req, res, next) => {
const tracer = trace.getTracer('http-middleware');
const propagator = new W3CTraceContextPropagator();
// Extract context from incoming request
const parentContext = propagator.extract(
context.active(),
req.headers,
defaultTextMapGetter
);
// Start new span as child of extracted context
const span = tracer.startSpan(
`${req.method} ${req.route.path}`,
{
kind: SpanKind.SERVER,
attributes: {
'http.method': req.method,
'http.url': req.url,
'http.target': req.route.path,
'http.host': req.hostname,
'http.scheme': req.protocol,
'user.id': req.user?.id,
},
},
parentContext
);
// Inject context for downstream services
const headers = {};
propagator.inject(
trace.setSpan(context.active(), span),
headers,
defaultTextMapSetter
);
req.traceHeaders = headers;
// Complete span on response
res.on('finish', () => {
span.setAttributes({
'http.status_code': res.statusCode,
'http.response.size': res.get('content-length'),
});
span.setStatus({
code: res.statusCode >= 400 ? SpanStatusCode.ERROR : SpanStatusCode.OK,
});
span.end();
});
next();
};
# AI-generated Loki configuration
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
ingester:
wal:
enabled: true
dir: /loki/wal
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2023-01-01
store: boltdb-shipper
object_store: s3
schema: v11
index:
prefix: index_
period: 24h
storage_config:
aws:
s3: s3://access_key:secret_key@endpoint/bucket
s3forcepathstyle: true
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
shared_store: s3
limits_config:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
max_streams_per_user: 10000
max_global_streams_per_user: 10000
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 10m
# AI-generated Prometheus alert rules
groups:
- name: service_health
interval: 30s
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
) > 0.05
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "{{ $labels.service }} has {{ $value | humanizePercentage }} error rate"
runbook_url: "https://wiki.example.com/runbooks/high-error-rate"
- alert: SLOBurnRateTooHigh
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)
) > (1 - 0.999) * 14.4
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "SLO burn rate too high for {{ $labels.service }}"
description: "At this rate, {{ $labels.service }} will exhaust error budget in < 1 day"
- alert: PodMemoryUsageHigh
expr: |
(
container_memory_working_set_bytes{pod!=""}
/
container_spec_memory_limit_bytes{pod!=""} > 0
) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} memory usage > 90%"
description: "Consider increasing memory limits or optimizing application"
# AI-generated AlertManager configuration
global:
resolve_timeout: 5m
slack_api_url: '{{ SLACK_API_URL }}'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
# Critical alerts go to PagerDuty
- match:
severity: critical
receiver: pagerduty
continue: true
# Business hours only for warnings
- match:
severity: warning
receiver: slack-warnings
active_time_intervals:
- business-hours
# Specific team routing
- match_re:
service: ^(payment|billing).*
receiver: finance-team
# Development environment alerts
- match:
environment: development
receiver: dev-null
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '{{ PAGERDUTY_SERVICE_KEY }}'
description: '{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}'
- name: 'dev-null'
# Silently drop
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'service']
"Create a Grafana dashboard that correlates:
- Traces showing slow requests
- Logs from those specific trace IDs
- Metrics showing system state during the incident
- Related alerts that fired
Use Grafana's Explore view with Loki/Tempo integration"
  1. Identify Anomaly

    # AI-generated anomaly detection query
    (
    rate(http_request_duration_seconds_sum[5m])
    /
    rate(http_request_duration_seconds_count[5m])
    ) >
    (
    avg_over_time(
    rate(http_request_duration_seconds_sum[5m])[1h:5m]
    /
    rate(http_request_duration_seconds_count[5m])[1h:5m]
    ) * 2
    )
  2. Trace Analysis

    "For trace ID X, analyze:
    - Which span took longest
    - Any errors in child spans
    - Database query patterns
    - External service calls"
  3. Log Investigation

    # AI-generated LogQL query
    {service="api"}
    | json
    | trace_id="abc123"
    | line_format "{{.timestamp}} [{{.level}}] {{.message}}"
    | pattern "<_> error=<error> <_>"
  4. Metric Correlation

    "Show me CPU, memory, and connection pool metrics
    for the service during the incident window"
@tempo @grafana "Create queries to find:
- Slowest endpoints by P95 latency
- Most frequent database queries
- Services with highest error rates
- Traces with most spans (complexity)
- Cache hit rates by service"

Automated Performance Analysis

// AI-generated performance analyzer
export class PerformanceAnalyzer {
async analyzeTraces(timeRange: TimeRange) {
const slowTraces = await this.tempo.search({
query: 'duration > 1s',
limit: 100,
start: timeRange.start,
end: timeRange.end,
});
const patterns = this.identifyPatterns(slowTraces);
const recommendations = this.generateRecommendations(patterns);
return {
bottlenecks: patterns.bottlenecks,
recommendations,
estimatedImpact: this.calculateImpact(patterns),
};
}
private identifyPatterns(traces: Trace[]) {
// Group by operation and analyze
const operationStats = traces.reduce((acc, trace) => {
trace.spans.forEach(span => {
if (!acc[span.operation]) {
acc[span.operation] = {
count: 0,
totalDuration: 0,
errors: 0,
};
}
acc[span.operation].count++;
acc[span.operation].totalDuration += span.duration;
if (span.status === 'error') acc[span.operation].errors++;
});
return acc;
}, {});
// Find bottlenecks
const bottlenecks = Object.entries(operationStats)
.filter(([_, stats]) => stats.totalDuration / stats.count > 100)
.sort((a, b) => b[1].totalDuration - a[1].totalDuration)
.slice(0, 5);
return { operationStats, bottlenecks };
}
}
# AI-generated retention policies
retention_policies:
metrics:
- selector: '{__name__=~"up|scrape_.*"}'
retention: 30d # Infrastructure metrics
- selector: '{__name__=~"http_.*"}'
retention: 90d # Application metrics
- selector: '{__name__=~"business_.*"}'
retention: 400d # Business metrics
logs:
- level: debug
retention: 1d
- level: info
retention: 7d
- level: warning
retention: 30d
- level: error
retention: 90d
traces:
- sampling_rate: 0.1 # 10% for normal traffic
- error_sampling: 1.0 # 100% for errors
- slow_sampling: 0.5 # 50% for slow requests
"Analyze our observability stack and suggest optimizations:
- Cardinality reduction for Prometheus
- Log parsing to reduce storage
- Trace sampling strategies
- Dashboard query optimization
- Alert deduplication"
{
"mcpServers": {
"sentry": {
"command": "npx",
"args": ["-y", "sentry-mcp"],
"env": {
"SENTRY_AUTH_TOKEN": "${SENTRY_AUTH_TOKEN}",
"SENTRY_ORG": "your-org"
}
}
}
}

Usage:

"Using Sentry MCP:
- Get error trends for the last 24h
- Find most frequent errors
- Analyze error impact by user count
- Create GitHub issues for critical errors"
{
"mcpServers": {
"grafana": {
"command": "npx",
"args": ["-y", "grafana-mcp"],
"env": {
"GRAFANA_URL": "https://grafana.example.com",
"GRAFANA_API_KEY": "${GRAFANA_API_KEY}"
}
}
}
}
  1. Version Control Everything

    • Dashboard definitions in JSON
    • Alert rules in YAML
    • Collector configurations
    • Retention policies
  2. Automate Deployment

    # AI-generated CI/CD for observability
    deploy-monitoring:
    steps:
    - name: Validate Configs
    run: |
    promtool check rules alerts/*.yml
    grafana-validator dashboards/*.json
    - name: Deploy Prometheus Rules
    run: |
    kubectl apply -f k8s/prometheus-rules.yaml
    - name: Import Dashboards
    run: |
    for dashboard in dashboards/*.json; do
    curl -X POST -H "Authorization: Bearer $GRAFANA_TOKEN" \
    -H "Content-Type: application/json" \
    -d @$dashboard \
    https://grafana.example.com/api/dashboards/db
    done
  3. Test Your Observability

    "Create tests for our monitoring:
    - Verify alerts fire correctly
    - Test dashboard queries return data
    - Validate trace propagation
    - Check log parsing rules"

Data Privacy

  • Mask sensitive data in logs
  • Implement field-level encryption
  • Use sampling for sensitive operations
  • Comply with retention policies

Access Control

  • RBAC for dashboards
  • Separate prod/dev data
  • Audit trail for changes
  • Secure credential storage
"Analyze Prometheus metrics for high cardinality:
- Find metrics with most unique label combinations
- Suggest label reduction strategies
- Implement recording rules for aggregation
- Add cardinality enforcement"
"Implement ML-based anomaly detection:
- Train models on historical metrics
- Detect unusual patterns automatically
- Predict capacity needs
- Auto-generate incident reports"
  • OpenTelemetry Profiling - Coming soon for continuous profiling
  • eBPF Integration - Zero-instrumentation observability
  • OTTL - OpenTelemetry Transformation Language
  • Trace-based Testing - Validate behavior via traces
  1. Start with OpenTelemetry - It’s vendor-neutral and future-proof
  2. Instrument comprehensively - Traces, metrics, and logs together
  3. Automate everything - From setup to insights
  4. Focus on business impact - Connect technical metrics to outcomes
  5. Optimize continuously - Monitor your monitoring

The combination of AI assistance and modern observability tools makes comprehensive monitoring accessible to teams of any size. Start small, iterate quickly, and let AI handle the complexity while you focus on insights.