Monitoring Patterns

Your checkout API started throwing 500s at 2am. The on-call dashboard was green the whole time, because you were alerting on CPU and memory, not on the user-facing error rate. By the time a customer tweeted, you’d lost an hour of orders. The fix is not “more metrics” — it’s the right alerts, written quickly, and reviewed for the failure modes that page you at 2am for nothing.

This recipe is about using an AI coding tool to do exactly that: turn a one-line spec into Prometheus rules, audit existing CloudWatch alarms for alert-fatigue traps, and scaffold the dashboards and OpenTelemetry wiring that make an alert actionable. The monitoring concepts (four golden signals, SLO burn rates) are standard SRE practice — the leverage is in how fast you can generate and verify them.

What You’ll Walk Away With

A reusable prompt that turns a golden-signals spec into a drop-in rules.yml for Prometheus
A prompt that turns an SLO target into a multi-window, multi-burn-rate alert (the pattern that pages on real budget burn, not noise)
A review prompt that finds alert-fatigue antipatterns in existing alarms (missing for:, no OK transition, histogram_quantile on un-bucketed metrics)
The three-tool split: when to use Cursor’s agent mode, when to script Claude Code headless in CI, and when to let Codex Cloud open the PR
A before/after on wiring the Sentry and Grafana MCP servers so the agent reasons over live incident data, not guesses

The Workflow: Spec to Alert Rules

Don’t hand-write PromQL from memory. Describe the signal in plain language and let the model produce the rule, then read it critically. The four golden signals (latency, traffic, errors, saturation) are the right default starting set for any HTTP service.

Copy-paste prompt for golden-signal alert rules:

Generate Prometheus alerting rules for the four golden signals for my
Express service "checkout-api":
- p95 latency > 500ms for 5m (warning)
- 5xx error ratio > 5% for 5m (critical)
- request rate drops below 50% of the 1h average for 10m (warning)
- CPU > 80% for 10m (warning)

Use http_request_duration_seconds_bucket and http_requests_total with a
`service` label, and node_cpu_seconds_total for CPU. Every rule must have a
`for:` window, a severity label, and an annotation with a summary and a
runbook_url placeholder. Output a single rules.yml I can drop into my
Prometheus config — no prose.

A good answer looks like this — note the for: on every rule and the symptom-based error expression rather than alerting on raw counts:

groups:
  - name: checkout-api-golden-signals
    rules:
      - alert: CheckoutHighLatencyP95
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])) by (le)
          ) > 0.5
        for: 5m
        labels: { severity: warning }
        annotations:
          summary: "p95 latency above 500ms for checkout-api"
          runbook_url: "https://runbooks.example.com/checkout-latency"

      - alert: CheckoutHighErrorRate
        expr: |
          sum(rate(http_requests_total{service="checkout-api",status=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="checkout-api"}[5m])) > 0.05
        for: 5m
        labels: { severity: critical }
        annotations:
          summary: "5xx error ratio above 5% for checkout-api"
          runbook_url: "https://runbooks.example.com/checkout-errors"

Then evaluate the output before trusting it. Two checks catch most bad generations: does histogram_quantile operate on a _bucket metric with le in the grouping (it silently returns NaN otherwise), and does every alert have a for: window so a single scrape blip doesn’t page anyone.

SLO burn-rate alerts

CPU and latency thresholds tell you something is off; an SLO burn-rate alert tells you you’re spending error budget faster than you can afford. The multi-window, multi-burn-rate pattern (from Google’s SRE workbook) fires only when both a short and a long window agree, which is what keeps it quiet during transient spikes.

Copy-paste prompt for SLO burn-rate alerts:

I run a 99.9% availability SLO over 30 days for "checkout-api", where a good
request is any non-5xx response. Generate Prometheus multi-window,
multi-burn-rate alert rules using the Google SRE workbook pattern: a fast-burn
page (1h + 5m windows, 14.4x burn rate) and a slow-burn ticket (6h + 30m
windows, 6x burn rate). Use http_requests_total with a status label. Output
rules.yml only, with a comment above each rule explaining the burn rate it
catches.

Three Tools, Three Workflows

Where each tool earns its place differs meaningfully for monitoring work, so this is not “identical across tools.”

Use agent mode to edit alert rules in place. Open monitoring/rules.yml, select the rule file as context, and ask the agent to add the SLO burn-rate block alongside the existing golden-signal rules — it edits inline so you can diff every change against what’s already there. Cursor’s checkpoints let you accept the latency rule but roll back a noisy traffic rule without losing the rest.

Keep promtool in the loop: after the agent writes rules, run promtool check rules monitoring/rules.yml in Cursor’s terminal and paste failures back. The model fixes PromQL syntax and for:/labels: shape far faster from the validator’s exact error than from a vague “this looks wrong.”

Claude Code shines headless in CI as a lint gate on your alert rules. Run it non-interactively with -p so it reviews changed rule files on every PR:

claude -p "Review the alert rules in monitoring/rules.yml for alert-fatigue \
antipatterns: any rule missing a 'for:' window, histogram_quantile used on a \
non-_bucket metric, or a severity:critical alert with no runbook_url \
annotation. List each issue as file:line — problem. Exit describing only \
problems, no praise." \
  --allowedTools "Read,Grep" --output-format json

Pipe the JSON into your CI step and fail the build if issues are returned. Because it only needs Read and Grep, it’s safe to run unattended — no edits, no shell. This is the workflow that catches the missing for: before it pages someone.

Use Codex Cloud to open a PR that adds monitoring you don’t want to write interactively. Point it at the repo with a task like “add SLO burn-rate alerts for checkout-api following the existing rules in monitoring/rules.yml, plus an Alertmanager route that sends severity: critical to PagerDuty.” Codex works in its own cloud worktree, runs promtool if it’s in your toolchain, and returns a reviewable PR with the diff and its reasoning.

For local work, the Codex CLI follows the same prompts above. Run codex --sandbox workspace-write -c approval_policy=on-request to keep writes inside the workspace and allow Codex to request approval before a command needs more access. Review the resulting diff; this policy does not require approval for every in-sandbox file edit.

Wiring the Alert to Real Data: MCP and Skills

A rule is only as good as the runbook behind it. Two extensibility paths make the agent reason over live incident data instead of inventing plausible-looking thresholds.

MCP servers give the agent a persistent, live connection. With the Sentry MCP (remote, hosted) connected, you can ask “draft an alert for the error spike Sentry saw on checkout-api this week” and the model pulls the actual issue, its frequency, and the linked release — so the threshold matches reality. The Grafana MCP lets it read your existing dashboards and datasources before proposing new panels, instead of duplicating ones you already have.

Add the remote Sentry MCP — the command is identical across all three tools except for where you run it:

Add to .cursor/mcp.json:

{
  "mcpServers": {
    "sentry": {
      "url": "https://mcp.sentry.dev/mcp"
    }
  }
}

claude mcp add --transport http sentry https://mcp.sentry.dev/mcp

Add to ~/.codex/config.toml:

[mcp_servers.sentry]
url = "https://mcp.sentry.dev/mcp"

Agent Skills are the lighter-weight option when you don’t need a live connection — just reusable expertise on how to write good monitoring config. Skills install via one universal CLI that works across Cursor, Claude Code, and Codex:

npx skills add wshobson/agents

That repo ships prometheus-configuration and grafana-dashboards skills (browse them on skills.sh). The trade-off is simple: a skill is a single-purpose augmentation that teaches the agent a pattern (great for “write idiomatic PromQL”); an MCP server is a persistent connection to live data (necessary for “what is actually firing right now”). For drafting rules from scratch, a skill is enough. For tuning thresholds against production, you want the MCP.

Modernizing Older Monitoring Code

A lot of monitoring code in real repos is years old and won’t run on current SDKs. AI tools are excellent at the mechanical migration — point them at the dead idiom and let them rewrite it.

Two migrations come up constantly. AWS SDK v2 reached end-of-support on September 8, 2025 — the global AWS namespace, new AWS.CloudWatch(), and .promise() are end-of-life and should not ship. And the OpenTelemetry JS metrics SDK moved: @opentelemetry/metrics was abandoned at 0.24.0, replaced by @opentelemetry/sdk-metrics (2.x), and MeterProvider.addMetricReader() was removed in SDK 2.0 in favor of passing readers via the constructor.

Copy-paste prompt to migrate CloudWatch + OpenTelemetry code:

Migrate this monitoring module to current SDKs, keeping behavior identical:
1. AWS SDK v2 -> v3 modular clients: replace require('aws-sdk') and
   new AWS.CloudWatch() with @aws-sdk/client-cloudwatch
   (CloudWatchClient, PutMetricAlarmCommand, PutMetricDataCommand);
   drop .promise() since send() already returns a promise.
2. OpenTelemetry: replace @opentelemetry/metrics with
   @opentelemetry/sdk-metrics, and instead of meterProvider.addMetricReader(),
   pass readers via `new MeterProvider({ readers: [reader] })`.
Output the migrated file plus the exact npm install/uninstall commands.

What the migrated code should look like — the v3 CloudWatch client and the OTel reader passed via the constructor:

const {
  CloudWatchClient,
  PutMetricAlarmCommand,
} = require('@aws-sdk/client-cloudwatch');
const { MeterProvider, PeriodicExportingMetricReader } =
  require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } =
  require('@opentelemetry/exporter-prometheus');

const cw = new CloudWatchClient({});
await cw.send(new PutMetricAlarmCommand({
  AlarmName: 'checkout-api-high-errors',
  Namespace: 'AWS/ApplicationELB',
  MetricName: 'HTTPCode_Target_5XX_Count',
  Statistic: 'Sum',
  Period: 300,
  EvaluationPeriods: 1,
  Threshold: 10,
  ComparisonOperator: 'GreaterThanThreshold',
  // OK transition matters — without it the alarm never resets
  AlarmActions: [process.env.SNS_TOPIC_ARN],
  OKActions: [process.env.SNS_TOPIC_ARN],
}));

// PrometheusExporter is a MetricReader — pass it via `readers`, not addMetricReader()
const meterProvider = new MeterProvider({
  readers: [new PrometheusExporter({ port: 9090 })],
});
const meter = meterProvider.getMeter('checkout-api');

After any AI-generated migration, run the install commands it suggested and let the type checker run. The model will occasionally keep a v2-only option name; the compiler catches those instantly, and feeding the error back gets a correct fix in one round.

When This Breaks

Failure modes that bite even when the prompt was good:

Alert storms from a missing for: window. A rule with no for: fires on a single bad scrape. If on-call is drowning, grep your rules for alerts that have an expr but no for: — that’s almost always the culprit.
histogram_quantile on an un-bucketed metric. If a latency panel reads NaN, the query is running histogram_quantile over a metric that isn’t a _bucket series, or le is missing from the by (...) clause. Models get this wrong when they don’t know your metric is a histogram.
CloudWatch alarms with no OK transition. Without OKActions, an alarm fires once and never tells you it recovered, so it looks “stuck firing.” Always pair AlarmActions with OKActions.
The model invents thresholds. Without live data (no Sentry/Grafana MCP), an AI will produce confident-looking numbers that don’t match your traffic. Treat generated thresholds as a starting point and tune against real percentiles before they page anyone.
Burn-rate constants copied from the wrong SLO. If the generated rule’s windows or multipliers don’t line up with your stated target, the page is either deaf or constant. Re-derive one constant by hand to confirm.

What’s Next

Debugging Patterns — turn the alert that just fired into a root cause
Logging Patterns — correlate logs with the metrics that triggered the alert
Recovery Patterns — automate the runbook your alert links to