Wzorce monitorowania

Skuteczne monitorowanie przekształca reaktywne gaszenie pożarów w proaktywne zarządzanie systemem. Z pomocą AI możesz budować zaawansowane systemy monitorowania, które wykrywają problemy zanim użytkownicy je zauważą, przewidują awarie i automatycznie inicjują odzyskiwanie. Ten przewodnik obejmuje sprawdzone wzorce monitorowania dla systemów produkcyjnych.

Filozofia nowoczesnego monitorowania

Skuteczne monitorowanie przestrzega tych zasad:

Alertuj o objawach

Skup się na problemach widocznych dla użytkowników, nie każdym skoku metryki

Złote sygnały

Opóźnienie, ruch, błędy i nasycenie opowiadają historię

Kontekst to król

Połącz metryki ze śladami, logami i wpływem biznesowym

Automatyzuj odpowiedź

Przekształcaj wglądy w automatyczne naprawy

Podstawowe wzorce monitorowania

Wzorzec 1: Cztery złote sygnały

Książka SRE Google spopularyzowała monitorowanie czterech kluczowych metryk wskazujących na zdrowie systemu.

Prometheus
CloudWatch

# Opóźnienie - Czas trwania żądania
- name: high_latency
  expr: |
    histogram_quantile(0.95,
      sum(rate(http_request_duration_seconds_bucket[5m]))
      by (service, method, le)
    ) > 0.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Wysokie opóźnienie w {{ $labels.service }}"
    description: "95. percentyl opóźnienia to {{ $value }}s"

# Ruch - Częstotliwość żądań
- name: traffic_spike
  expr: |
    sum(rate(http_requests_total[5m])) by (service)
    > 2 * avg_over_time(
        sum(rate(http_requests_total[5m])) by (service)[1h:5m]
      )
  for: 10m
  labels:
    severity: info
  annotations:
    summary: "Wykryto skok ruchu"

# Błędy - Współczynnik błędów
- name: high_error_rate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
    /
    sum(rate(http_requests_total[5m])) by (service)
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Współczynnik błędów powyżej 5%"

# Nasycenie - Użycie zasobów
- name: high_cpu_usage
  expr: |
    100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
    > 80
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Użycie CPU powyżej 80%"

// Implementacja AWS CloudWatch
const AWS = require('aws-sdk');
const cloudwatch = new AWS.CloudWatch();

class GoldenSignalsMonitor {
  async createAlarms(serviceName) {
    // Alarm opóźnienia
    await cloudwatch.putMetricAlarm({
      AlarmName: `${serviceName}-high-latency`,
      MetricName: 'Duration',
      Namespace: 'AWS/Lambda',
      Statistic: 'Average',
      Period: 300,
      EvaluationPeriods: 2,
      Threshold: 1000,
      ComparisonOperator: 'GreaterThanThreshold',
      AlarmActions: [process.env.SNS_TOPIC_ARN]
    }).promise();

    // Alarm współczynnika błędów
    await cloudwatch.putMetricAlarm({
      AlarmName: `${serviceName}-high-errors`,
      MetricName: 'Errors',
      Namespace: 'AWS/Lambda',
      Statistic: 'Sum',
      Period: 300,
      EvaluationPeriods: 1,
      Threshold: 10,
      ComparisonOperator: 'GreaterThanThreshold'
    }).promise();

    // Niestandardowa metryka dla ruchu
    const params = {
      Namespace: 'CustomApp',
      MetricData: [{
        MetricName: 'RequestCount',
        Value: 1,
        Unit: 'Count',
        Dimensions: [
          { Name: 'Service', Value: serviceName }
        ]
      }]
    };
    await cloudwatch.putMetricData(params).promise();
  }
}

Wzorzec 2: Monitorowanie oparte na SLI/SLO

Monitoruj wskaźniki poziomu usługi (SLI) względem celów poziomu usługi (SLO).

// Implementacja monitorowania SLO
class SLOMonitor {
  constructor(prometheus) {
    this.prometheus = prometheus;
    this.slos = new Map();
  }

  defineSLO(name, config) {
    this.slos.set(name, {
      name,
      description: config.description,
      sli: config.sli,
      target: config.target,
      window: config.window || '30d',
      burnRate: config.burnRate || {
        '1h': 14.4,   // 14.4x współczynnik spalania = 5% budżetu w 1h
        '6h': 6,      // 6x współczynnik spalania = 5% budżetu w 6h
        '1d': 3,      // 3x współczynnik spalania = 10% budżetu w 1d
        '3d': 1       // 1x współczynnik spalania = 10% budżetu w 3d
      }
    });
  }

  generateAlerts() {
    const alerts = [];

    for (const [name, slo] of this.slos) {
      // Alerty wielookienkowe, wielospalaninowe
      for (const [shortWindow, shortBurn] of Object.entries({
        '5m': 14.4,
        '30m': 6
      })) {
        for (const [longWindow, longBurn] of Object.entries({
          '1h': 14.4,
          '6h': 6
        })) {
          alerts.push({
            alert: `${name}_burn_rate`,
            expr: `(
              ${slo.sli}[${shortWindow}] < ${slo.target}
              AND
              ${slo.sli}[${longWindow}] < ${slo.target}
            )`,
            labels: {
              severity: 'page',
              slo: name
            },
            annotations: {
              summary: `Przekroczono współczynnik spalania SLO ${name}`,
              description: `Współczynnik spalania budżetu błędów jest powyżej progu`
            }
          });
        }
      }
    }

    return alerts;
  }

  calculateErrorBudget(sloName, timeRange = '30d') {
    const slo = this.slos.get(sloName);
    const query = `
      1 - (
        sum(increase(${slo.sli}_total[${timeRange}]))
        /
        sum(increase(requests_total[${timeRange}]))
      )
    `;

    return this.prometheus.query(query);
  }
}

// Użycie
const monitor = new SLOMonitor(prometheusClient);

monitor.defineSLO('api-availability', {
  description: 'SLO dostępności API',
  sli: 'http_requests_total{status!~"5.."}',
  target: 0.999,  // 99.9% dostępności
  window: '30d'
});

monitor.defineSLO('api-latency', {
  description: 'SLO opóźnienia API',
  sli: 'http_request_duration_seconds{quantile="0.95"} < 0.3',
  target: 0.95,   // 95% żądań poniżej 300ms
  window: '30d'
});

Wzorzec 3: Predykcyjne monitorowanie z AI

Używaj uczenia maszynowego do przewidywania problemów zanim wystąpią.

Wykrywanie anomalii wspomagane AI

class PredictiveMonitor {
  constructor(metricsStore, alertManager) {
    this.metricsStore = metricsStore;
    this.alertManager = alertManager;
    this.models = new Map();
  }

  async trainModel(metricName, options = {}) {
    const historicalData = await this.metricsStore.query({
      metric: metricName,
      start: '-30d',
      end: 'now',
      step: options.step || '5m'
    });

    // Inżynieria cech
    const features = this.extractFeatures(historicalData);

    // Trenuj model (uproszczone - użyj właściwej biblioteki ML)
    const model = {
      metric: metricName,
      seasonality: this.detectSeasonality(features),
      trend: this.calculateTrend(features),
      stdDev: this.calculateStdDev(features),
      updateTime: Date.now()
    };

    this.models.set(metricName, model);
    return model;
  }

  extractFeatures(data) {
    return {
      hourOfDay: data.map(d => new Date(d.timestamp).getHours()),
      dayOfWeek: data.map(d => new Date(d.timestamp).getDay()),
      values: data.map(d => d.value),
      differences: data.slice(1).map((d, i) => d.value - data[i].value),
      movingAvg: this.movingAverage(data.map(d => d.value), 12)
    };
  }

  async detectAnomalies(metricName, realtimeValue) {
    const model = this.models.get(metricName);
    if (!model) {
      throw new Error(`Brak wytrenowanego modelu dla ${metricName}`);
    }

    const now = new Date();
    const expectedValue = this.predict(model, now);
    const threshold = model.stdDev * 3; // Reguła 3-sigma

    const anomaly = Math.abs(realtimeValue - expectedValue) > threshold;

    if (anomaly) {
      const confidence = this.calculateConfidence(
        realtimeValue,
        expectedValue,
        threshold
      );

      await this.alertManager.createAlert({
        title: `Wykryto anomalię w ${metricName}`,
        severity: confidence > 0.9 ? 'critical' : 'warning',
        details: {
          expected: expectedValue,
          actual: realtimeValue,
          deviation: Math.abs(realtimeValue - expectedValue),
          confidence,
          model: {
            lastUpdated: new Date(model.updateTime),
            accuracy: model.accuracy
          }
        }
      });
    }

    return { anomaly, expectedValue, confidence };
  }

  predict(model, timestamp) {
    const hour = timestamp.getHours();
    const day = timestamp.getDay();

    // Uproszczona predykcja łącząca trend i sezonowość
    let prediction = model.trend.baseline;

    // Dodaj sezonowość godzinową
    if (model.seasonality.hourly) {
      prediction += model.seasonality.hourly[hour];
    }

    // Dodaj sezonowość tygodniową
    if (model.seasonality.weekly) {
      prediction += model.seasonality.weekly[day];
    }

    return prediction;
  }

  async forecastCapacity(resource, days = 30) {
    const model = this.models.get(`${resource}_usage`);
    const currentUsage = await this.getCurrentUsage(resource);
    const growthRate = model.trend.rate;

    const forecast = [];
    for (let d = 0; d < days; d++) {
      const predictedUsage = currentUsage * Math.pow(1 + growthRate, d);
      forecast.push({
        date: new Date(Date.now() + d * 24 * 60 * 60 * 1000),
        usage: predictedUsage,
        percentOfCapacity: (predictedUsage / 100) * 100
      });
    }

    // Alertuj jeśli przekroczona zostanie pojemność
    const capacityBreach = forecast.find(f => f.percentOfCapacity > 80);
    if (capacityBreach) {
      await this.alertManager.createAlert({
        title: `Ostrzeżenie o pojemności ${resource}`,
        severity: 'warning',
        details: {
          message: `${resource} osiągnie 80% pojemności ${capacityBreach.date}`,
          currentUsage: `${currentUsage}%`,
          projectedUsage: `${capacityBreach.usage}%`,
          daysUntilBreach: forecast.indexOf(capacityBreach)
        }
      });
    }

    return forecast;
  }
}

Wzorzec 4: Monitorowanie metryk biznesowych

Połącz metryki techniczne z wynikami biznesowymi.

class BusinessMetricsMonitor {
  constructor(analytics, monitoring) {
    this.analytics = analytics;
    this.monitoring = monitoring;
  }

  async setupBusinessDashboard() {
    // Monitorowanie wpływu na przychody
    await this.monitoring.createMetric({
      name: 'revenue_per_minute',
      query: `
        sum(rate(order_total_amount[1m]))
        *
        avg(order_conversion_rate)
      `,
      unit: 'dollars',
      alerts: [{
        condition: 'decrease > 20%',
        severity: 'critical',
        message: 'Wykryto spadek dochodów'
      }]
    });

    // Metryki doświadczenia użytkownika
    await this.monitoring.createMetric({
      name: 'apdex_score',
      query: `
        (
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
          +
          sum(rate(http_request_duration_seconds_bucket{le="2.0"}[5m])) / 2
        ) / sum(rate(http_request_duration_seconds_count[5m]))
      `,
      unit: 'ratio',
      alerts: [{
        condition: '< 0.8',
        severity: 'warning',
        message: 'Pogorszone doświadczenie użytkownika (Apdex < 0.8)'
      }]
    });

    // Monitorowanie lejka konwersji
    const funnelSteps = [
      'page_view',
      'add_to_cart',
      'checkout_start',
      'payment_complete'
    ];

    for (let i = 1; i < funnelSteps.length; i++) {
      const fromStep = funnelSteps[i - 1];
      const toStep = funnelSteps[i];

      await this.monitoring.createMetric({
        name: `conversion_${fromStep}_to_${toStep}`,
        query: `
          sum(rate(events_total{event="${toStep}"}[5m]))
          /
          sum(rate(events_total{event="${fromStep}"}[5m]))
        `,
        unit: 'ratio',
        alerts: [{
          condition: 'decrease > 15%',
          severity: 'warning',
          message: `Spadek konwersji: ${fromStep} → ${toStep}`
        }]
      });
    }
  }

  async correlateWithTechnical(businessMetric, timeRange) {
    // Znajdź metryki techniczne korelujące z metrykami biznesowymi
    const businessData = await this.analytics.getMetric(businessMetric, timeRange);
    const technicalMetrics = await this.monitoring.getAllMetrics();

    const correlations = [];

    for (const techMetric of technicalMetrics) {
      const techData = await this.monitoring.getMetric(techMetric.name, timeRange);
      const correlation = this.calculateCorrelation(businessData, techData);

      if (Math.abs(correlation) > 0.7) {
        correlations.push({
          technical: techMetric.name,
          business: businessMetric,
          correlation,
          impact: this.estimateImpact(correlation, techData, businessData)
        });
      }
    }

    return correlations.sort((a, b) => Math.abs(b.correlation) - Math.abs(a.correlation));
  }
}

Zaawansowane wzorce monitorowania

Integracja rozproszonego śledzenia

Połącz metryki ze śladami dla głębokich wglądów.

// Integracja OpenTelemetry
const { MeterProvider } = require('@opentelemetry/metrics');
const { PrometheusExporter } = require('@opentelemetry/exporter-prometheus');
const { trace, context } = require('@opentelemetry/api');

class TracingMetricsCollector {
  constructor() {
    this.exporter = new PrometheusExporter({ port: 9090 });
    this.meterProvider = new MeterProvider();
    this.meterProvider.addMetricReader(this.exporter);

    this.meters = new Map();
    this.setupMetrics();
  }

  setupMetrics() {
    const meter = this.meterProvider.getMeter('app-metrics');

    // Histogram czasu trwania żądań połączony ze śladami
    this.requestDuration = meter.createHistogram('http_request_duration', {
      description: 'Czas trwania żądania HTTP w sekundach',
      unit: 's'
    });

    // Gauge aktywnych span
    this.activeSpans = meter.createUpDownCounter('active_spans', {
      description: 'Liczba aktywnych span'
    });

    // Licznik błędów z kontekstem śledzenia
    this.errors = meter.createCounter('errors_total', {
      description: 'Całkowita liczba błędów z kontekstem śledzenia'
    });
  }

  recordRequest(duration, attributes) {
    const span = trace.getActiveSpan();
    const spanContext = span?.spanContext();

    this.requestDuration.record(duration, {
      ...attributes,
      trace_id: spanContext?.traceId,
      span_id: spanContext?.spanId,
      has_error: span?.status?.code === 2
    });
  }

  async monitorTraceHealth() {
    // Monitoruj skuteczność próbkowania śladów
    const samplingRate = meter.createGauge('trace_sampling_rate', {
      description: 'Aktualny współczynnik próbkowania śladów'
    });

    // Monitoruj kompletność śladów
    const traceCompleteness = meter.createGauge('trace_completeness', {
      description: 'Procent kompletnych śladów'
    });

    setInterval(async () => {
      const stats = await this.calculateTraceStats();
      samplingRate.record(stats.samplingRate);
      traceCompleteness.record(stats.completeness);
    }, 60000);
  }
}

Monitorowanie syntetyczne

Proaktywnie testuj krytyczne ścieżki użytkownika.

class SyntheticMonitor {
  constructor(monitoring, alerting) {
    this.monitoring = monitoring;
    this.alerting = alerting;
    this.scenarios = new Map();
  }

  defineScenario(name, scenario) {
    this.scenarios.set(name, {
      name,
      interval: scenario.interval || 300000, // 5 minut domyślnie
      timeout: scenario.timeout || 30000,
      steps: scenario.steps,
      assertions: scenario.assertions,
      regions: scenario.regions || ['us-east-1']
    });
  }

  async runScenario(name) {
    const scenario = this.scenarios.get(name);
    const results = new Map();

    for (const region of scenario.regions) {
      const startTime = Date.now();
      const stepResults = [];
      let success = true;

      try {
        for (const step of scenario.steps) {
          const stepStart = Date.now();
          const result = await this.executeStep(step, region);

          stepResults.push({
            name: step.name,
            duration: Date.now() - stepStart,
            success: result.success,
            details: result.details
          });

          if (!result.success) {
            success = false;
            break;
          }
        }

        // Uruchom asercje
        if (success && scenario.assertions) {
          for (const assertion of scenario.assertions) {
            if (!await this.checkAssertion(assertion, stepResults)) {
              success = false;
              break;
            }
          }
        }
      } catch (error) {
        success = false;
        stepResults.push({
          name: 'error',
          error: error.message,
          stack: error.stack
        });
      }

      const totalDuration = Date.now() - startTime;

      results.set(region, {
        success,
        duration: totalDuration,
        steps: stepResults,
        timestamp: new Date()
      });

      // Zapisz metryki
      await this.recordMetrics(name, region, {
        success,
        duration: totalDuration,
        steps: stepResults
      });
    }

    // Alertuj o niepowodzeniach
    await this.checkAlerts(name, results);

    return results;
  }

  async executeStep(step, region) {
    switch (step.type) {
      case 'http':
        return await this.executeHttpStep(step, region);
      case 'browser':
        return await this.executeBrowserStep(step, region);
      case 'api':
        return await this.executeApiStep(step, region);
      default:
        throw new Error(`Nieznany typ kroku: ${step.type}`);
    }
  }

  async recordMetrics(scenarioName, region, result) {
    // Współczynnik sukcesu
    await this.monitoring.recordMetric({
      name: 'synthetic_success_rate',
      value: result.success ? 1 : 0,
      labels: {
        scenario: scenarioName,
        region,
        type: 'synthetic'
      }
    });

    // Czas trwania
    await this.monitoring.recordMetric({
      name: 'synthetic_duration_seconds',
      value: result.duration / 1000,
      labels: {
        scenario: scenarioName,
        region,
        success: result.success
      }
    });

    // Metryki na poziomie kroków
    for (const step of result.steps) {
      await this.monitoring.recordMetric({
        name: 'synthetic_step_duration_seconds',
        value: step.duration / 1000,
        labels: {
          scenario: scenarioName,
          step: step.name,
          region,
          success: step.success
        }
      });
    }
  }
}

// Przykładowy scenariusz
monitor.defineScenario('checkout-flow', {
  interval: 300000, // Uruchamiaj co 5 minut
  timeout: 30000,   // 30 sekund timeout
  regions: ['us-east-1', 'eu-west-1', 'ap-southeast-1'],
  steps: [
    {
      name: 'load-homepage',
      type: 'http',
      url: 'https://example.com',
      expectedStatus: 200
    },
    {
      name: 'search-product',
      type: 'api',
      endpoint: '/api/search',
      method: 'GET',
      params: { q: 'test-product' },
      expectedStatus: 200,
      validateResponse: (res) => res.results.length > 0
    },
    {
      name: 'add-to-cart',
      type: 'api',
      endpoint: '/api/cart',
      method: 'POST',
      body: { productId: 'test-123', quantity: 1 },
      expectedStatus: 201
    },
    {
      name: 'checkout',
      type: 'browser',
      script: async (page) => {
        await page.goto('https://example.com/checkout');
        await page.fill('#email', 'test@example.com');
        await page.click('button[type="submit"]');
        await page.waitForSelector('.success-message');
      }
    }
  ],
  assertions: [
    {
      name: 'total-time-under-5s',
      check: (results) => {
        const totalTime = results.reduce((sum, r) => sum + r.duration, 0);
        return totalTime < 5000;
      }
    }
  ]
});

Najlepsze praktyki monitorowania

Zapobieganie zmęczeniu alertami

Używaj routingu i deduplikacji alertów
Wdrażaj okna wyciszania alertów
Grupuj powiązane alerty razem
Ustaw odpowiednie poziomy ważności

Projektowanie dashboardów

Zacznij od przeglądu, przejdź do szczegółów
Używaj konsystentnych schematów kolorów
Dołączaj odpowiednie zakresy czasowe
Dodawaj kontekst i dokumentację

Nazewnictwo metryk

Używaj konsystentnych konwencji nazewniczych
Dołączaj jednostki w nazwach metryk
Przestrzegaj najlepszych praktyk Prometheusa
Dokumentuj niestandardowe metryki

Optymalizacja kosztów

Próbkuj metryki o wysokiej kardynalności
Ustaw odpowiednie polityki retencji
Używaj agregacji metryk
Monitoruj koszty monitorowania

Zasady projektowania alertów

Twórz alerty, które są możliwe do podjęcia działań i redukują hałas:

# Dobry przykład alertu
groups:
  - name: api_alerts
    rules:
      - alert: APIHighErrorRate
        expr: |
          (
            sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          ) > 0.05
        for: 5m
        labels:
          severity: critical
          team: platform
          pager: true
        annotations:
          summary: "Wysoki współczynnik błędów w {{ $labels.service }}"
          description: |
            Współczynnik błędów to {{ $value | humanizePercentage }} dla {{ $labels.service }}.

            Dashboard: https://grafana.example.com/d/api-errors
            Runbook: https://wiki.example.com/runbooks/api-errors

            Ostatnie zmiany: https://github.com/example/{{ $labels.service }}/commits
          impact: "Użytkownicy doświadczają niepowodzeń przy korzystaniu z {{ $labels.service }}"
          action: |
            1. Sprawdź logi usługi dla szczegółów błędu
            2. Zweryfikuj zależności upstream
            3. Rozważ wycofanie ostatnich wdrożeń
            4. Skaluj w górę jeśli związane z obciążeniem

Budowanie skutecznych dashboardów

// Dashboard Grafany jako kod
const dashboardConfig = {
  title: 'Przegląd zdrowia usługi',
  panels: [
    // Rząd 1: Kluczowe metryki na pierwszy rzut oka
    {
      title: 'Status usługi',
      type: 'stat',
      gridPos: { x: 0, y: 0, w: 6, h: 4 },
      targets: [{
        expr: 'up{job="api"}',
        format: 'table'
      }],
      thresholds: {
        mode: 'absolute',
        steps: [
          { color: 'red', value: 0 },
          { color: 'green', value: 1 }
        ]
      }
    },
    {
      title: 'Aktualny QPS',
      type: 'graph',
      gridPos: { x: 6, y: 0, w: 6, h: 4 },
      targets: [{
        expr: 'sum(rate(http_requests_total[1m]))',
        legendFormat: 'Żądania/sek'
      }]
    },
    {
      title: 'Współczynnik błędów',
      type: 'gauge',
      gridPos: { x: 12, y: 0, w: 6, h: 4 },
      targets: [{
        expr: '100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))'
      }],
      thresholds: {
        mode: 'percentage',
        steps: [
          { color: 'green', value: 0 },
          { color: 'yellow', value: 1 },
          { color: 'red', value: 5 }
        ]
      }
    },
    {
      title: 'Opóźnienie P95',
      type: 'stat',
      gridPos: { x: 18, y: 0, w: 6, h: 4 },
      targets: [{
        expr: 'histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))'
      }],
      unit: 's',
      thresholds: {
        mode: 'absolute',
        steps: [
          { color: 'green', value: 0 },
          { color: 'yellow', value: 0.5 },
          { color: 'red', value: 1 }
        ]
      }
    },

    // Rząd 2: Szczegółowe widoki
    {
      title: 'Częstotliwość żądań według endpointu',
      type: 'graph',
      gridPos: { x: 0, y: 4, w: 12, h: 8 },
      targets: [{
        expr: 'sum(rate(http_requests_total[5m])) by (handler)',
        legendFormat: '{{ handler }}'
      }]
    },
    {
      title: 'Rozkład opóźnienia',
      type: 'heatmap',
      gridPos: { x: 12, y: 4, w: 12, h: 8 },
      targets: [{
        expr: 'sum(rate(http_request_duration_seconds_bucket[5m])) by (le)',
        format: 'heatmap'
      }]
    }
  ]
};

Automatyzacja monitorowania

Automatyzuj zadania monitorowania aby zmniejszyć obciążenie:

// Automatyczne generowanie dashboardów
class DashboardGenerator {
  generateServiceDashboard(serviceName) {
    return {
      title: `Dashboard usługi ${serviceName}`,
      uid: `${serviceName}-overview`,
      panels: [
        this.generateREDPanel(serviceName),
        this.generateResourcePanel(serviceName),
        this.generateDependencyPanel(serviceName)
      ],
      templating: {
        list: [{
          name: 'namespace',
          type: 'query',
          query: 'label_values(namespace)'
        }]
      }
    };
  }

  generateREDPanel(service) {
    // Panele Rate, Errors, Duration
    return [
      {
        title: 'Częstotliwość żądań',
        targets: [{
          expr: `sum(rate(http_requests_total{service="${service}"}[5m]))`
        }]
      },
      {
        title: 'Współczynnik błędów',
        targets: [{
          expr: `sum(rate(http_requests_total{service="${service}",status=~"5.."}[5m]))`
        }]
      },
      {
        title: 'Czas trwania (P50/P95/P99)',
        targets: [
          {
            expr: `histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket{service="${service}"}[5m])) by (le))`,
            legendFormat: 'P50'
          },
          {
            expr: `histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="${service}"}[5m])) by (le))`,
            legendFormat: 'P95'
          },
          {
            expr: `histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="${service}"}[5m])) by (le))`,
            legendFormat: 'P99'
          }
        ]
      }
    ];
  }
}

Kolejne kroki

Opanuj monitorowanie z:

Wzorce debugowania - Rozwiązywanie problemów z danymi monitorowania
Wzorce logowania - Korelacja logów z metrykami
Wzorce odzyskiwania - Automatyzacja odpowiedzi na alerty