Cost monitoring — alerts, hard caps, and cheap-model routing

Q21 · Operations & hygiene How do you monitor token usage / agent cost?

Max-score answer: “I have alerts / hard caps and deliberately use cheaper models for expensive loops.”

Why it matters: Without monitoring you can’t reason about model routing or session length. The cheapest cost-control lever is visibility.

Why this matters in 2026

The FinOps Foundation’s 2026 survey shows AI-cost management has become a mainstream concern, but a survey headline does not determine an individual developer’s risk. Agent cost varies with model, context, tool calls, retries, and parallelism; measure representative sessions and configure limits rather than relying on unsourced per-task or industry-wide loss estimates.

For an individual developer the math is more brutal, not less. A Claude Code session can chew through $30–60 in Opus 4.8 tokens on a single afternoon if you wander into a multi-hour debug loop. A “small” subagent fan-out — five parallel reviewers reading a 500-file repo — costs cents on Haiku 4.5 and $40+ on Opus. A misconfigured cron agent that loops on an error and keeps re-prompting itself can vaporise a month’s budget overnight.

The cheapest control lever is visibility. You cannot reason about model routing (Q3), session length, or which subagents to push to Haiku if you don’t know what each loop costs. Cost monitoring is not a finance task — it’s the feedback loop that makes every other operational decision honest. Without it you’re guessing, and the bill will tell you you guessed wrong four weeks later.

What “max score” actually looks like

You get full marks on Q21 only when all four are true:

You have a dashboard you actually look at. Not “there is a usage page in the Console” — you’ve bookmarked it, check it weekly, and can tell me from memory roughly what you spent last month and which model dominated. For Claude Code that’s the Anthropic Console plus a local tool like ccusage. For Cursor it’s the per-user dashboard. For raw API or OpenRouter it’s the provider console plus per-key spend filters.
You have hard caps configured at the provider. Not “I’ll watch it” — an actual workspace spend limit in the Anthropic Console (or OpenAI usage hard limit, or Cursor team spend cap) that cuts you off before a runaway loop drains the card. The point of a hard cap is that it fires when you’re asleep.
You have alerts that fire before the cap. Soft alerts at 50% and 80% of your daily or weekly budget — email, Slack, or webhook — so you adjust before hitting the wall. Setting a hard cap with no alerting means your first signal of trouble is your agent grinding to a halt mid-task.
You route expensive loops to cheaper models on purpose. Five parallel code-reviewer subagents run on Haiku 4.5, not Opus. A 200-file codemod runs on Sonnet or Haiku. You don’t run “smart” models on dumb loops, and you can name two loops last week where you specifically chose Haiku to save money.

Anything less — “I check the bill at month-end”, “I have a feeling Opus is expensive”, “I keep meaning to set a limit” — is mid-tier on Q21.

Current landscape (web-search-verified)

Anthropic Console usage view

The Anthropic Console (console.anthropic.com) is the primary surface for anyone on the Claude API or a paid Claude Code plan. The usage view breaks spend down by workspace, API key, model, and day — exactly the granularity you need to spot when one subagent is doing all the damage. Workspace spend limits let you set a hard ceiling per workspace; once you cross it, requests in that workspace fail until you raise the cap or the period resets. Pair that with usage alerts (email at a configurable percentage of the cap) and you have both the soft signal and the kill-switch.

For Claude Code subscribers, the /usage slash command shows your current 5-hour billing window, remaining quota, and a rough cost estimate. /context shows context fill — long-context sessions are where Opus runs become silently expensive. Team and Enterprise plans add a team-aggregate view in the same console.anthropic.com usage surface.

`ccusage` and similar CLI cost trackers

ccusage is the de-facto open-source tracker for Claude Code in 2026. It reads the JSONL session files Claude Code writes locally to ~/.claude/projects/, parses every tool call, and gives per-session, per-day, and per-model cost estimates without hitting the Anthropic API. It works offline, surfaces per-session cost (so you can spot the one 4-hour Opus debug that cost $28), and can be aliased into your shell prompt or statusline.

Neighbouring tools worth knowing:

Claude Code Usage Monitor (Maciek-roboblog/Claude-Code-Usage-Monitor) — terminal dashboard with burn-rate predictions and “time to limit” warnings.
claude-usage (phuryn/claude-usage) — local dashboard with charts; Pro/Max subscribers get a progress bar.
ccflare — alternative session-file parser, sometimes faster on large histories.
cccost (badlogic/cccost) — runs Claude Code through a thin wrapper that instruments every call and writes a per-session .usage.json with real token counts and cost, ideal for a per-session ceiling or statusline widget.

None solve team-wide spend well; they’re per-developer tools reading local files. For team-level visibility you still need the Anthropic Console (or an LLM-observability layer — see below).

Cursor cost dashboard

Cursor’s current individual plans are Free, Pro ($20), Pro+ ($60), and Ultra ($200). Agent usage is metered against plan allowances and model pricing rather than the retired Fast/Slow request buckets. Use the live Settings → Usage view for allowance, model-level consumption, additional usage, and any team spending controls available to your account.

OpenAI usage dashboard

For anyone using GPT-5.x as a cross-vendor fallback (Q3), OpenAI’s Usage dashboard (platform.openai.com/usage) is the equivalent: spend by day, model, and API key, plus configurable hard limits (cuts you off) and soft limits (email alert). OpenAI’s hard limits apply at the organisation level — if you have multiple projects sharing one org, set per-project budgets for isolation.

Setting hard caps + alerts (provider-side and self-hosted)

The pattern is the same everywhere; only the UI differs.

Anthropic — Console → Settings → Workspaces → Spend limit. Set a monthly hard cap and an email alert at 80% (and optionally 50%). For Claude Code subscribers, the subscription is a soft cap via the 5-hour window, but any API-key usage should still have a workspace limit.
OpenAI — Settings → Billing → Usage limits. Set hard and soft limits; toggle per-project so the cap applies where you need it.
Cursor — Settings → Team → Spending controls. Per-user caps for admins; individuals can set their own in personal settings.
OpenRouter / AI.cc — per-key spend limits plus webhook notifications. Useful when you’ve abstracted across providers.
Self-hosted observability — Langfuse, Helicone, Portkey, Vantage, and Traceloop are the dominant 2026 stack for trace-level cost attribution. They sit in front of your LLM calls (proxy or SDK middleware), tag every request with metadata (session, user, agent, PR number), and let you alert on derived metrics — e.g. “alert if any single agent session crosses $5”. Overkill for solo developers, the right answer once a team has 3–4+ agent users.

Routing expensive loops to Haiku/Sonnet

Cost monitoring without routing is only observation. Current Anthropic rates are Fable 5 at $10/$50, Opus 4.8 at $5/$25, Sonnet 5 at the introductory $2/$10 through August 31 (then $3/$15), and Haiku 4.5 at $1/$5 per MTok. These rate-card ratios assume the same token mix; actual task cost also depends on verbosity, caching, tools, and retries.

Almost always on Haiku:

Code review subagents. Reading a diff and flagging issues is well within Haiku 4.5’s competence, and you want to fan out 3–5 reviewers anyway.
Codemods and mechanical refactors. “Rename this function across 200 files.” Haiku does this fine; Opus is a 5× tax for zero quality gain.
Triage, classification, summarisation. Sorting tickets, classifying log lines, summarising changelogs.
Subagent fan-out. If the orchestrator is Sonnet, the leaves should usually be Haiku.

Justifies Sonnet, Opus, or Fable 5:

Fable 5 (claude-fable-5) — evaluate it for complex, long-running work when its measured result justifies the $10/$50 rate card. Subagents may inherit Fable unless explicitly pinned, so set their model deliberately and monitor total task cost. See model comparison for the current breakdown.
Plan-mode planning for non-trivial changes — use Sonnet 5 for cost-efficient planning; move to Opus 4.8 or Fable 5 when the extra reasoning quality justifies its cost.
Cross-file refactors holding 30+ files in coherent context — Opus 4.8 or Fable 5 when complexity is highest.
Normal coding loops (read, edit, test, fix) — Sonnet 5.
Anything where Haiku has failed in A/B against Sonnet. Evidence-based, not vibes-based.

Step-by-step: building cost visibility

Open the Anthropic Console and look at last month. console.anthropic.com → Usage. Filter by model. Note how much went to Opus vs Sonnet vs Haiku, which day was the most expensive, and how that compares to your plan budget. First pass is purely diagnostic — don’t change anything yet.
Install ccusage locally. npm install -g ccusage. Run it after a typical session to see per-session cost. Once you trust the numbers, alias it into your shell prompt or statusline so you see live cost during work, not at month-end.
Set a workspace spend hard cap in the Anthropic Console. Pick ~1.3× your typical month — high enough not to fire in normal use, low enough to catch a runaway loop overnight. Console → Settings → Workspaces → Spend limit.
Set soft alerts at 50% and 80% of the cap. Same screen, “Usage alerts” — add email notifications. Verify by temporarily lowering the threshold until it fires once, then resetting.
Do the same for OpenAI if you use GPT-5.x as a fallback. platform.openai.com/usage → Limits. If you have multiple projects in one org, set per-project, not just at org level.
Identify your three most expensive loops from the dashboard. Last 30 days, pick the three model-loop combinations that ate the most tokens. For each: did this need Opus, or would Sonnet/Haiku do the same job?
Re-route the top loop to a cheaper tier. Subagent on Opus → Haiku frontmatter change. Bulk codemod on Sonnet → run it via Haiku. Commit the change, don’t just remember to do it.
Add a per-PR cost line to your workflow. Either a Langfuse / Helicone proxy tagging by branch, or ccusage --since <branch-start> at the end of each PR. Make cost legible at the unit of work you ship.
Make it a weekly habit. Every Friday: open the dashboard, look at the per-model split, ask “did anything surprise me?” Surprises are where the next routing decision lives.
Re-check after every model release. When a new tier ships, re-rank your top three loops against the new price card. The whole point of visibility is that re-routing becomes a one-line config change.

Common pitfalls

No per-PR or per-task cost. You know your monthly bill but can’t say what shipping last Tuesday’s feature cost. Without per-unit cost you can’t tell which features are expensive to build, and can’t decide whether an agent is worth running on that class of task. Fix: tag traces by PR or branch (Langfuse / Helicone) or read ccusage --since at the end of each branch.
Runaway loops with no kill-switch. Retry loops can consume budget quickly. Pair provider limits with a measured per-session threshold, a max-token limit, or an observability alert that cancels anomalous sessions.
Ignoring outliers in the monthly total. If 70% of last month’s spend came from three sessions, that’s where the next optimisation lives — not in shaving 5% off every other session. Sort by cost descending and read the top three.
Optimising the wrong loop. Spending an afternoon migrating subagents to Haiku to save $4/month while the real Opus debug loop quietly costs $200/month. Cost-rank loops before optimising; the dashboard tells you which fish are big.
Hard cap with no alerting. The cap fires, your agent dies mid-PR, and you triage the wrong problem. Always pair a hard cap with a soft alert at 80%.
Trusting the model’s self-reported usage instead of the dashboard. Claude Code’s /usage is fine for the current session; for monthly accounting trust the Console. Local estimators (including ccusage) can drift when the model card or tokenizer changes.
Assuming token count tracks the published rate card. The per-token price is only half the bill — the token count is the other half, and tokenizer changes between model versions can shift how many tokens the same input text produces. Same prompt, same rate card, different bill. Trust the dashboard total, not a back-of-envelope from the published per-token rate.

Cost monitoring — alerts, hard caps, and cheap-model routing

Why this matters in 2026

What “max score” actually looks like

Current landscape (web-search-verified)

Anthropic Console usage view

`ccusage` and similar CLI cost trackers

Cursor cost dashboard

OpenAI usage dashboard

Setting hard caps + alerts (provider-side and self-hosted)

Routing expensive loops to Haiku/Sonnet

Step-by-step: building cost visibility

Common pitfalls

How to verify you’re there

Further reading

Cost monitoring — alerts, hard caps, and cheap-model routing

Why this matters in 2026

What “max score” actually looks like

Current landscape (web-search-verified)

Anthropic Console usage view

ccusage and similar CLI cost trackers

Cursor cost dashboard

OpenAI usage dashboard

Setting hard caps + alerts (provider-side and self-hosted)

Routing expensive loops to Haiku/Sonnet

Step-by-step: building cost visibility

Common pitfalls

How to verify you’re there

Further reading

`ccusage` and similar CLI cost trackers