Cost monitoring — alerts, hard caps, and cheap-model routing
Q21 · Operations & hygiene How do you monitor token usage / agent cost?
Max-score answer: “I have alerts / hard caps and deliberately use cheaper models for expensive loops.”
Why it matters: Without monitoring you can’t reason about model routing or session length. The cheapest cost-control lever is visibility.
Why this matters in 2026
Section titled “Why this matters in 2026”The FinOps Foundation’s 2026 survey shows 98% of organisations now actively manage AI spend — up from 63% a year earlier — yet only 44% have any financial guardrails wired in. The gap is visible in real numbers: an unconstrained coding agent solving one SWE-bench-class task burns $5–8 in API fees, agents make 3–10× more LLM calls than chatbots, and in April 2026 a public Fortune-500 audit attributed a $400M collective overspend to agent sessions running without per-session ceilings.
For an individual developer the math is more brutal, not less. A Claude Code session can chew through $30–60 in Opus 4.7 tokens on a single afternoon if you wander into a multi-hour debug loop. A “small” subagent fan-out — five parallel reviewers reading a 500-file repo — costs cents on Haiku 4.5 and $40+ on Opus. A misconfigured cron agent that loops on an error and keeps re-prompting itself can vaporise a month’s budget overnight.
The cheapest control lever is visibility. You cannot reason about model routing (Q3), session length, or which subagents to push to Haiku if you don’t know what each loop costs. Cost monitoring is not a finance task — it’s the feedback loop that makes every other operational decision honest. Without it you’re guessing, and the bill will tell you you guessed wrong four weeks later.
What “max score” actually looks like
Section titled “What “max score” actually looks like”You get full marks on Q21 only when all four are true:
- You have a dashboard you actually look at. Not “there is a usage page in the Console” — you’ve bookmarked it, check it weekly, and can tell me from memory roughly what you spent last month and which model dominated. For Claude Code that’s the Anthropic Console plus a local tool like
ccusage. For Cursor it’s the per-user dashboard. For raw API or OpenRouter it’s the provider console plus per-key spend filters. - You have hard caps configured at the provider. Not “I’ll watch it” — an actual workspace spend limit in the Anthropic Console (or OpenAI usage hard limit, or Cursor team spend cap) that cuts you off before a runaway loop drains the card. The point of a hard cap is that it fires when you’re asleep.
- You have alerts that fire before the cap. Soft alerts at 50% and 80% of your daily or weekly budget — email, Slack, or webhook — so you adjust before hitting the wall. Setting a hard cap with no alerting means your first signal of trouble is your agent grinding to a halt mid-task.
- You route expensive loops to cheaper models on purpose. Five parallel code-reviewer subagents run on Haiku 4.5, not Opus. A 200-file codemod runs on Sonnet or Haiku. You don’t run “smart” models on dumb loops, and you can name two loops last week where you specifically chose Haiku to save money.
Anything less — “I check the bill at month-end”, “I have a feeling Opus is expensive”, “I keep meaning to set a limit” — is mid-tier on Q21.
Current landscape (web-search-verified)
Section titled “Current landscape (web-search-verified)”Anthropic Console usage view
Section titled “Anthropic Console usage view”The Anthropic Console (console.anthropic.com) is the primary surface for anyone on the Claude API or a paid Claude Code plan. The usage view breaks spend down by workspace, API key, model, and day — exactly the granularity you need to spot when one subagent is doing all the damage. Workspace spend limits let you set a hard ceiling per workspace; once you cross it, requests in that workspace fail until you raise the cap or the period resets. Pair that with usage alerts (email at a configurable percentage of the cap) and you have both the soft signal and the kill-switch.
For Claude Code subscribers, the /usage slash command shows your current 5-hour billing window, remaining quota, and a rough cost estimate. /context shows context fill — long-context sessions are where Opus runs become silently expensive. Team and Enterprise plans add a team-aggregate view at claude.com/console.
ccusage and similar CLI cost trackers
Section titled “ccusage and similar CLI cost trackers”ccusage is the de-facto open-source tracker for Claude Code in 2026. It reads the JSONL session files Claude Code writes locally to ~/.claude/projects/, parses every tool call, and gives per-session, per-day, and per-model cost estimates without hitting the Anthropic API. It works offline, surfaces per-session cost (so you can spot the one 4-hour Opus debug that cost $28), and can be aliased into your shell prompt or statusline.
Neighbouring tools worth knowing:
- Claude Code Usage Monitor (
Maciek-roboblog/Claude-Code-Usage-Monitor) — terminal dashboard with burn-rate predictions and “time to limit” warnings. - claude-usage (
phuryn/claude-usage) — local dashboard with charts; Pro/Max subscribers get a progress bar. - ccflare — alternative session-file parser, sometimes faster on large histories.
- cc-budget — community enforcer that wraps Claude Code with per-session ceilings.
None solve team-wide spend well; they’re per-developer tools reading local files. For team-level visibility you still need the Anthropic Console (or an LLM-observability layer — see below).
Cursor cost dashboard
Section titled “Cursor cost dashboard”Cursor’s 2026 pricing is request-based with a tier-dependent monthly cap and overage billing. The dashboard at cursor.com/dashboard (also under Settings → Account) shows requests used, remaining, overage rate, and a per-model breakdown. The team plan adds per-seat breakdowns. Cursor added per-user hard caps for team admins in 2026 — set a ceiling per seat and Cursor will charge the user directly (or block them) once they cross it.
OpenAI usage dashboard
Section titled “OpenAI usage dashboard”For anyone using GPT-5.x as a cross-vendor fallback (Q3), OpenAI’s Usage dashboard (platform.openai.com/usage) is the equivalent: spend by day, model, and API key, plus configurable hard limits (cuts you off) and soft limits (email alert). OpenAI’s hard limits apply at the organisation level — if you have multiple projects sharing one org, set per-project budgets for isolation.
Setting hard caps + alerts (provider-side and self-hosted)
Section titled “Setting hard caps + alerts (provider-side and self-hosted)”The pattern is the same everywhere; only the UI differs.
- Anthropic — Console → Settings → Workspaces → Spend limit. Set a monthly hard cap and an email alert at 80% (and optionally 50%). For Claude Code subscribers, the subscription is a soft cap via the 5-hour window, but any API-key usage should still have a workspace limit.
- OpenAI — Settings → Billing → Usage limits. Set hard and soft limits; toggle per-project so the cap applies where you need it.
- Cursor — Settings → Team → Spending controls. Per-user caps for admins; individuals can set their own in personal settings.
- OpenRouter / AI.cc — per-key spend limits plus webhook notifications. Useful when you’ve abstracted across providers.
- Self-hosted observability — Langfuse, Helicone, Portkey, Vantage, and Traceloop are the dominant 2026 stack for trace-level cost attribution. They sit in front of your LLM calls (proxy or SDK middleware), tag every request with metadata (session, user, agent, PR number), and let you alert on derived metrics — e.g. “alert if any single agent session crosses $5”. Overkill for solo developers, the right answer once a team has 3–4+ agent users.
Routing expensive loops to Haiku/Sonnet
Section titled “Routing expensive loops to Haiku/Sonnet”Cost monitoring without routing is just observation. The real save shows up when you push high-volume, low-judgment loops onto cheaper models. The 2026 ladder: Opus 4.7 at $5/$25, Sonnet 4.6 at $3/$15, Haiku 4.5 at $1/$5 — 5× spread on output tokens. Five parallel Haiku subagents cost less than one Opus pass.
Almost always on Haiku:
- Code review subagents. Reading a diff and flagging issues is well within Haiku 4.5’s competence, and you want to fan out 3–5 reviewers anyway.
- Codemods and mechanical refactors. “Rename this function across 200 files.” Haiku does this fine; Opus is a 5× tax for zero quality gain.
- Triage, classification, summarisation. Sorting tickets, classifying log lines, summarising changelogs.
- Subagent fan-out. If the orchestrator is Sonnet, the leaves should usually be Haiku.
Justifies Sonnet or Opus:
- Plan-mode planning for non-trivial changes — Opus 4.7.
- Cross-file refactors holding 30+ files in coherent context — Opus.
- Normal coding loops (read, edit, test, fix) — Sonnet 4.6.
- Anything where Haiku has failed in A/B against Sonnet. Evidence-based, not vibes-based.
Step-by-step: building cost visibility
Section titled “Step-by-step: building cost visibility”-
Open the Anthropic Console and look at last month.
console.anthropic.com→ Usage. Filter by model. Note how much went to Opus vs Sonnet vs Haiku, which day was the most expensive, and how that compares to your plan budget. First pass is purely diagnostic — don’t change anything yet. -
Install
ccusagelocally.npm install -g ccusage(orpipx install ccusage). Run it after a typical session to see per-session cost. Once you trust the numbers, alias it into your shell prompt or statusline so you see live cost during work, not at month-end. -
Set a workspace spend hard cap in the Anthropic Console. Pick ~1.3× your typical month — high enough not to fire in normal use, low enough to catch a runaway loop overnight. Console → Settings → Workspaces → Spend limit.
-
Set soft alerts at 50% and 80% of the cap. Same screen, “Usage alerts” — add email notifications. Verify by temporarily lowering the threshold until it fires once, then resetting.
-
Do the same for OpenAI if you use GPT-5.x as a fallback.
platform.openai.com/usage→ Limits. If you have multiple projects in one org, set per-project, not just at org level. -
Identify your three most expensive loops from the dashboard. Last 30 days, pick the three model-loop combinations that ate the most tokens. For each: did this need Opus, or would Sonnet/Haiku do the same job?
-
Re-route the top loop to a cheaper tier. Subagent on Opus → Haiku frontmatter change. Bulk codemod on Sonnet → run it via Haiku. Commit the change, don’t just remember to do it.
-
Add a per-PR cost line to your workflow. Either a Langfuse / Helicone proxy tagging by branch, or
ccusage --since <branch-start>at the end of each PR. Make cost legible at the unit of work you ship. -
Make it a weekly habit. Every Friday: open the dashboard, look at the per-model split, ask “did anything surprise me?” Surprises are where the next routing decision lives.
-
Re-check after every model release. When a new tier ships, re-rank your top three loops against the new price card. The whole point of visibility is that re-routing becomes a one-line config change.
Common pitfalls
Section titled “Common pitfalls”- No per-PR or per-task cost. You know your monthly bill but can’t say what shipping last Tuesday’s feature cost. Without per-unit cost you can’t tell which features are expensive to build, and can’t decide whether an agent is worth running on that class of task. Fix: tag traces by PR or branch (Langfuse / Helicone) or read
ccusage --sinceat the end of each branch. - Runaway loops with no kill-switch. An agent stuck in error → retry → re-prompt can burn $50 in an hour. The hard cap is your last line of defence, but the better fix is a per-session ceiling:
cc-budget, a max-token wrapper, or a Langfuse alert on session cost that auto-cancels. - Ignoring outliers in the monthly total. If 70% of last month’s spend came from three sessions, that’s where the next optimisation lives — not in shaving 5% off every other session. Sort by cost descending and read the top three.
- Optimising the wrong loop. Spending an afternoon migrating subagents to Haiku to save $4/month while the real Opus debug loop quietly costs $200/month. Cost-rank loops before optimising; the dashboard tells you which fish are big.
- Hard cap with no alerting. The cap fires, your agent dies mid-PR, and you triage the wrong problem. Always pair a hard cap with a soft alert at 80%.
- Trusting the model’s self-reported usage instead of the dashboard. Claude Code’s
/usageis fine for the current session; for monthly accounting trust the Console. Local estimators (includingccusage) can drift when the model card or tokenizer changes. - Treating Opus 4.7 like Opus 4.6. Opus 4.7’s tokenizer can produce up to 35% more tokens for the same input text. Same prompt, same rate card, higher bill. Check your bill, not just the published rate.
How to verify you’re there
Section titled “How to verify you’re there”- You can open the Anthropic Console usage view in under 10 seconds and you’ve looked at it in the last 7 days.
- You have a workspace spend limit configured, and you know what number it’s set to without checking.
- You have soft alerts at 50% and 80% of the cap, and at least one of them has fired in the last 90 days (proving it works).
- You have
ccusageor an equivalent installed locally, and you can read your last session’s cost in one command. - You can name two specific loops in the last two weeks where you consciously chose Haiku or Sonnet over Opus to save cost.
- At least one subagent in your project runs on Haiku 4.5 specifically because you costed it out, not just because of a default.
- You know your top three most expensive sessions last month and what drove the cost.
- You have at least one per-PR or per-task cost number, even if it’s a manual
ccusage --sincereading. - If you use a fallback vendor (Q3), you’ve set hard caps on that vendor too — not just on Anthropic.
- You re-check your cost dashboard after every model release.