Skip to content

Enterprise Cost Control

Your team’s Claude Code API bill tripled this month, finance wants a hard cap by Friday, and nobody can tell you which projects or people are driving the spend. The instinct is to build a homegrown gateway that meters every request — but Claude Code already ships the visibility and guardrails you need. This guide shows the real cost surfaces and how to wire them into budgets without killing developer velocity.

  • A per-session and per-team way to see token spend using /cost, /stats, /context, and the status line — no custom telemetry code required
  • An OpenTelemetry export that feeds claude_code.cost.usage and claude_code.token.usage into a dashboard, broken down by team and model
  • Hard budget guardrails for automation: --max-budget-usd, --max-turns, and per-subagent model: haiku
  • A model-routing policy that reserves Opus 4.8 for the work that needs it and pushes everyday tasks to Sonnet 4.6 and Haiku 4.5
  • Copy-paste prompts to audit a bloated session, stand up an OTel exporter, and add a context-budget hook

Claude Code costs scale with context size and conversation length. Every message re-sends the system prompt, your CLAUDE.md, MCP tool definitions, and the accumulated conversation. Anthropic’s published baseline is around $6 per developer per day, with 90% of users under $12/day — but automation, long-running sessions, and oversized context blow past that fast.

The three biggest, most-controllable drivers:

  • Stale context — leftover files and conversation from a previous task you never cleared
  • MCP tool definitions — every connected server adds tool schemas to context even when idle
  • Model choice — running Opus 4.8 for formatting and lint fixes that Haiku 4.5 handles for a fraction of the cost

Start here before you restrict anything. You cannot manage what you cannot see, and Claude Code exposes usage at four levels.

  1. Check the current session. Run /cost in the REPL to see token usage and API cost for this session (API users), or /stats for usage patterns on Max/Pro:

    /cost
    Total cost: $0.55
    Total duration (API): 6m 19.7s
    Total code changes: 0 lines added, 0 lines removed
  2. See what’s eating context. Run /context to break down exactly what is consuming your window — system prompt, CLAUDE.md, MCP tool definitions, and conversation history. This is how you find the MCP server you forgot to disable.

  3. Keep usage visible continuously. Configure the status line to display context window usage so you see token pressure on every message instead of discovering it at invoice time.

  4. Roll up across the team. For API workspaces, set workspace spend limits and read cost and usage reporting in the Anthropic Console. The “Claude Code” workspace is created automatically on first authentication and centralizes org-wide tracking.

/cost is per-session. For a finance-grade view across every developer, export Claude Code’s OpenTelemetry metrics to your existing observability stack. The two metrics that matter for cost are claude_code.cost.usage (USD) and claude_code.token.usage (tokens, with an input/output type attribute).

Enable it with environment variables — no code:

Terminal window
# Turn on telemetry and pick your exporters
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp # otlp, prometheus, or console
export OTEL_LOGS_EXPORTER=otlp
# Point at your OTLP collector
export OTEL_EXPORTER_OTLP_PROTOCOL=grpc
export OTEL_EXPORTER_OTLP_ENDPOINT=http://collector.company.com:4317
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ${OTEL_TOKEN}"

To attribute spend to teams, tag every session with a resource attribute:

Terminal window
export OTEL_RESOURCE_ATTRIBUTES="department=payments,team=checkout"

Now claude_code.cost.usage and claude_code.token.usage arrive in your backend sliced by department, team, model, and user.account_uuid, so a single Grafana panel answers “which team spent what, on which model.”

For org-wide rollout, push these settings through the managed settings file instead of asking every developer to set env vars:

{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp",
"OTEL_EXPORTER_OTLP_PROTOCOL": "grpc",
"OTEL_EXPORTER_OTLP_ENDPOINT": "http://collector.company.com:4317"
}
}

Interactive sessions self-correct — a developer notices a runaway loop and hits Escape. Headless and CI runs do not. For any non-interactive claude -p invocation, cap the blast radius directly with print-mode flags:

Terminal window
# Stop spending past $5 on this run, and never exceed 8 agentic turns
claude -p "Triage failing tests in src/ and propose fixes" \
--max-budget-usd 5.00 \
--max-turns 8 \
--model sonnet

--max-budget-usd stops the run once API spend crosses the limit; --max-turns exits with an error after N agentic turns so a misbehaving loop can’t burn your budget unattended. Both are print-mode (-p) only.

The single highest-leverage cost lever is not metering — it is model choice. Sonnet 4.6 handles the bulk of coding tasks at roughly three-fifths of Opus 4.8’s price; Haiku 4.5 is cheaper still — about a fifth of Opus — for mechanical work.

ModelRough price (input / output per Mtok)Reach for it when
Claude Fable 5~$10 / ~$50Plan-mode planning and final verification on the hardest tasks (2x Opus)
Claude Opus 4.8~$5 / ~$25Architecture, multi-step reasoning, gnarly debugging
Claude Sonnet 4.6~$3 / ~$15Everyday coding, refactors, test writing (your default)
Claude Haiku 4.5~$1 / ~$5Formatting, lint fixes, comments, high-volume mechanical edits

The top row is new: Claude Fable 5 (released June 9, 2026) sits a tier above Opus 4.8 at exactly twice its price. On a budget, treat it as a bracket rather than a default — escalate to Fable 5 (/model fable) for Plan-mode planning up front and the final verification pass at the end, and run the implementation in between on Sonnet 4.6 or Opus 4.8. Teams where budget matters less than velocity and quality can instead set Fable 5 as the default, since subagents still auto-run on Opus, Sonnet, and Haiku — the bulk token volume stays on cheaper tiers while the main loop gets maximum intelligence.

Switch mid-session with /model, or set a default in /config. The biggest win is in subagents: delegate verbose work to a cheaper model so the expensive context stays small. Specify the model per subagent in its configuration:

---
name: test-runner
description: Runs the test suite and returns only failures
tools: [Bash, Read]
model: haiku
---
You run the project's test command, then summarize only failing tests and their
error messages. Never paste full passing output.

Two more high-impact levers from the same playbook:

  • Tune extended thinking. It is on by default with a generous budget. For simple tasks, lower the effort level in /model on a model that supports it (the effort slider appears for Opus), disable thinking in /config, or cap it with MAX_THINKING_TOKENS=8000 — thinking tokens bill as output.
  • Cut MCP overhead. Run /context, then /mcp to disable servers you are not using. Prefer CLI tools (gh, aws, gcloud, sentry-cli) over MCP servers where possible — they add no persistent tool definitions to context.

Token cost is a direct function of context size. Claude Code auto-compacts near the limit and caches the system prompt, but the cheap wins are habits:

  • /clear between unrelated tasks. Stale context is re-billed on every subsequent message. Use /rename before clearing so you can /resume later.
  • Guide compaction. /compact Focus on the API changes and failing tests tells Claude what to keep when it summarizes.
  • Offload to hooks. A PreToolUse hook can grep a 10,000-line log down to the matching errors before Claude ever reads it, turning tens of thousands of tokens into hundreds.
  • Move workflow instructions out of CLAUDE.md into skills. CLAUDE.md loads at session start and is billed even on unrelated work; skills load on demand. Keep CLAUDE.md under ~500 lines.
  • /cost shows pennies but the bill is huge. You are on a subscription where /cost is not your invoice, or spend is coming from CI keys that never run /cost interactively. Reconcile against Console workspace reporting and per-key usage, not session output.
  • OTel dashboard is empty on Bedrock/Vertex/Foundry. Claude Code does not emit cost metrics through cloud providers. Track via the provider’s billing or a LiteLLM gateway with per-key spend; see LLM Gateway.
  • --max-budget-usd didn’t stop a runaway job. It only applies to print mode and only enforces against API-key billing. For subscriptions or interactive sessions there is no dollar meter — rely on workspace spend limits and --max-turns.
  • Telemetry lags or floods your backend. Metrics export every 60s and logs every 5s by default. Tune OTEL_METRIC_EXPORT_INTERVAL, and use the OTEL_METRICS_INCLUDE_* cardinality controls to keep storage costs down — a per-session-ID metric explosion can cost more than the Claude usage you are tracking.
  • A model downgrade tanks quality. Cost cuts that produce broken code are not savings. Keep Opus 4.8 available for the hard 10% and measure rework, not just token spend.
  • Leaving Fable 5 as your default after June 22, 2026. From June 9 through June 22, 2026 it is included on Pro, Max, Team, and seat-based Enterprise plans at no extra cost — past that window it bills usage credits at 2x Opus rates. Scope it back to planning and final verification unless the extra spend is deliberate.

Effective cost control is about balance: maximize value, eliminate waste. Get visibility first with /cost, /stats, and OTel; then apply model routing and guardrails. Most teams cut 30-50% without touching velocity — just by clearing context, right-sizing models, and turning off MCP servers they forgot were running.