Skip to content

Monitoring Patterns

Your checkout API started throwing 500s at 2am. The on-call dashboard was green the whole time, because you were alerting on CPU and memory, not on the user-facing error rate. By the time a customer tweeted, you’d lost an hour of orders. The fix is not “more metrics” — it’s the right alerts, written quickly, and reviewed for the failure modes that page you at 2am for nothing.

This recipe is about using an AI coding tool to do exactly that: turn a one-line spec into Prometheus rules, audit existing CloudWatch alarms for alert-fatigue traps, and scaffold the dashboards and OpenTelemetry wiring that make an alert actionable. The monitoring concepts (four golden signals, SLO burn rates) are standard SRE practice — the leverage is in how fast you can generate and verify them.

  • A reusable prompt that turns a golden-signals spec into a drop-in rules.yml for Prometheus
  • A prompt that turns an SLO target into a multi-window, multi-burn-rate alert (the pattern that pages on real budget burn, not noise)
  • A review prompt that finds alert-fatigue antipatterns in existing alarms (missing for:, no OK transition, histogram_quantile on un-bucketed metrics)
  • The three-tool split: when to use Cursor’s agent mode, when to script Claude Code headless in CI, and when to let Codex Cloud open the PR
  • A before/after on wiring the Sentry and Grafana MCP servers so the agent reasons over live incident data, not guesses

Don’t hand-write PromQL from memory. Describe the signal in plain language and let the model produce the rule, then read it critically. The four golden signals (latency, traffic, errors, saturation) are the right default starting set for any HTTP service.

A good answer looks like this — note the for: on every rule and the symptom-based error expression rather than alerting on raw counts:

groups:
- name: checkout-api-golden-signals
rules:
- alert: CheckoutHighLatencyP95
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket{service="checkout-api"}[5m])) by (le)
) > 0.5
for: 5m
labels: { severity: warning }
annotations:
summary: "p95 latency above 500ms for checkout-api"
runbook_url: "https://runbooks.example.com/checkout-latency"
- alert: CheckoutHighErrorRate
expr: |
sum(rate(http_requests_total{service="checkout-api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="checkout-api"}[5m])) > 0.05
for: 5m
labels: { severity: critical }
annotations:
summary: "5xx error ratio above 5% for checkout-api"
runbook_url: "https://runbooks.example.com/checkout-errors"

Then evaluate the output before trusting it. Two checks catch most bad generations: does histogram_quantile operate on a _bucket metric with le in the grouping (it silently returns NaN otherwise), and does every alert have a for: window so a single scrape blip doesn’t page anyone.

CPU and latency thresholds tell you something is off; an SLO burn-rate alert tells you you’re spending error budget faster than you can afford. The multi-window, multi-burn-rate pattern (from Google’s SRE workbook) fires only when both a short and a long window agree, which is what keeps it quiet during transient spikes.

Where each tool earns its place differs meaningfully for monitoring work, so this is not “identical across tools.”

Use agent mode to edit alert rules in place. Open monitoring/rules.yml, select the rule file as context, and ask the agent to add the SLO burn-rate block alongside the existing golden-signal rules — it edits inline so you can diff every change against what’s already there. Cursor’s checkpoints let you accept the latency rule but roll back a noisy traffic rule without losing the rest.

Keep promtool in the loop: after the agent writes rules, run promtool check rules monitoring/rules.yml in Cursor’s terminal and paste failures back. The model fixes PromQL syntax and for:/labels: shape far faster from the validator’s exact error than from a vague “this looks wrong.”

Wiring the Alert to Real Data: MCP and Skills

Section titled “Wiring the Alert to Real Data: MCP and Skills”

A rule is only as good as the runbook behind it. Two extensibility paths make the agent reason over live incident data instead of inventing plausible-looking thresholds.

MCP servers give the agent a persistent, live connection. With the Sentry MCP (remote, hosted) connected, you can ask “draft an alert for the error spike Sentry saw on checkout-api this week” and the model pulls the actual issue, its frequency, and the linked release — so the threshold matches reality. The Grafana MCP lets it read your existing dashboards and datasources before proposing new panels, instead of duplicating ones you already have.

Add the remote Sentry MCP — the command is identical across all three tools except for where you run it:

Add to .cursor/mcp.json:

{
"mcpServers": {
"sentry": {
"url": "https://mcp.sentry.dev/mcp"
}
}
}

Agent Skills are the lighter-weight option when you don’t need a live connection — just reusable expertise on how to write good monitoring config. Skills install via one universal CLI that works across Cursor, Claude Code, and Codex:

Terminal window
npx skills add wshobson/agents

That repo ships prometheus-configuration and grafana-dashboards skills (browse them on skills.sh). The trade-off is simple: a skill is a single-purpose augmentation that teaches the agent a pattern (great for “write idiomatic PromQL”); an MCP server is a persistent connection to live data (necessary for “what is actually firing right now”). For drafting rules from scratch, a skill is enough. For tuning thresholds against production, you want the MCP.

A lot of monitoring code in real repos is years old and won’t run on current SDKs. AI tools are excellent at the mechanical migration — point them at the dead idiom and let them rewrite it.

Two migrations come up constantly. AWS SDK v2 reached end-of-support on September 8, 2025 — the global AWS namespace, new AWS.CloudWatch(), and .promise() are end-of-life and should not ship. And the OpenTelemetry JS metrics SDK moved: @opentelemetry/metrics was abandoned at 0.24.0, replaced by @opentelemetry/sdk-metrics (2.x), and MeterProvider.addMetricReader() was removed in SDK 2.0 in favor of passing readers via the constructor.

What the migrated code should look like — the v3 CloudWatch client and the OTel reader passed via the constructor:

const {
CloudWatchClient,
PutMetricAlarmCommand,
} = require('@aws-sdk/client-cloudwatch');
const { MeterProvider, PeriodicExportingMetricReader } =
require('@opentelemetry/sdk-metrics');
const { PrometheusExporter } =
require('@opentelemetry/exporter-prometheus');
const cw = new CloudWatchClient({});
await cw.send(new PutMetricAlarmCommand({
AlarmName: 'checkout-api-high-errors',
Namespace: 'AWS/ApplicationELB',
MetricName: 'HTTPCode_Target_5XX_Count',
Statistic: 'Sum',
Period: 300,
EvaluationPeriods: 1,
Threshold: 10,
ComparisonOperator: 'GreaterThanThreshold',
// OK transition matters — without it the alarm never resets
AlarmActions: [process.env.SNS_TOPIC_ARN],
OKActions: [process.env.SNS_TOPIC_ARN],
}));
// PrometheusExporter is a MetricReader — pass it via `readers`, not addMetricReader()
const meterProvider = new MeterProvider({
readers: [new PrometheusExporter({ port: 9090 })],
});
const meter = meterProvider.getMeter('checkout-api');

After any AI-generated migration, run the install commands it suggested and let the type checker run. The model will occasionally keep a v2-only option name; the compiler catches those instantly, and feeding the error back gets a correct fix in one round.