Context is the cost driver
Every file, prompt, and prior turn is re-sent on each message. A bloated context taxes every single turn, not just the first one.
Your Claude Code bill crept up and you can’t say why. One session scanned the whole repo, another kept a stale 80-file context alive across a dozen unrelated questions, and a quick “fix the lint” turned into a full-tree read. The fix isn’t to use Claude less — it’s to feed it less. Smaller, sharper context is cheaper and produces better answers, because the model isn’t diluting its attention across files that don’t matter.
/context, /clear, /compact, --add-dir, and CLI-over-MCPCost scales with context size: the more Claude reads on each turn, the more you pay. Two things are working in your favor automatically, and one common assumption is no longer true.
Context is the cost driver
Every file, prompt, and prior turn is re-sent on each message. A bloated context taxes every single turn, not just the first one.
Caching is automatic
Claude Code caches stable content (system prompt, repeated context) and bills it at a steep discount. You don’t enable it — but you can bust it.
Compaction is automatic
When you approach the context limit, Claude Code summarizes older history for you. You can steer it with /compact.
1M context isn't free
Fable 5, Opus 4.8, and Sonnet 4.6 all carry 1M-token windows (Haiku 4.5 is 200K). A bigger window removes the hard ceiling, but you still pay per token — discipline still matters.
You can’t optimize what you can’t see. The first move in any session that feels expensive is to look at what’s actually loaded.
/context inside the claude REPL to visualize what’s consuming the window — files, MCP tool definitions, conversation history./cost to see token usage and spend for the current session.If /context shows MCP servers eating a big slice before you’ve done anything, that’s a quick win — see the model and MCP section below.
The biggest savings come from not loading what you don’t need. Claude Code takes your prompt positionally; it does not take file or directory paths as positional arguments. To widen its reach, use --add-dir; to keep it narrow, just say which files matter in the prompt and let agentic search do the rest.
# Wrong: paths are not positional args# claude "refactor auth" auth/ middleware/ utils/
# Right: name the entry point in the prompt; Claude explores from thereclaude "Analyze the auth flow starting from src/auth/core.ts and tell me wheresession validation happens."
# Right: widen access deliberately for a monorepoclaude --add-dir ../apps ../lib "Update the shared logger and every app that uses it."Then manage context across the session:
Clear between unrelated tasks. /clear drops stale context so you’re not paying to re-send last task’s files on every new message.
Compact with focus when you must keep going. /compact Focus on the files we changed and the API contract summarizes history while preserving what matters.
Re-check with /context after big phases to confirm the window is actually lean.
Automatic prompt caching only helps when the cached prefix stays stable. The fastest way to throw the discount away is to change early, stable context — reordering files, swapping the system prompt, or editing a file that sits near the front of the context — which invalidates everything cached after it.
Practical rules:
/clear) rather than reshuffling the current one. A clean slate caches cleanly.CLAUDE.md so it’s part of the cached system context instead of pasted ad hoc each time.Vague prompts produce verbose, exploratory responses; specific prompts get specific answers and read fewer files. Be opinionated about scope and output format.
Refactor the auth middleware from the old-jwt library to new-jwt. Keep the public APIidentical. Change only src/middleware/auth.ts and its test. Reply with the diff, no prose.Bug: intermittent "Invalid token format" on mobile login.Likely site: JWT validation in src/mobile-auth.ts around line 42.Read only that file and its direct imports. Show the corrected function and a one-lineexplanation. Don't refactor anything else.Implement Redis caching for getThing():- check cache, on miss fetch + store with a 5-minute TTL- TypeScript, with error handling for a Redis outage (fail open to the source)Write the implementation and one test. Code first, brief notes after.Match the model to the task. Sonnet 4.6 handles most coding work at lower cost; reserve Opus 4.8 for genuinely hard reasoning; reach for Fable 5 when velocity and quality matter more than cost — complex multi-file refactors, building from scratch, or long-running tasks that demand peak intelligence; use Haiku 4.5 for trivial, high-volume edits. Switch mid-session with /model, or set it up front.
The environment variable is ANTHROPIC_MODEL (not CLAUDE_MODEL), and you can use either an alias or a full model ID:
# Aliases — simplestclaude --model sonnet "refactor the authentication system"claude --model haiku "fix the typo in README.md"
# Full IDs via env var, e.g. for the deepest reviewANTHROPIC_MODEL=claude-opus-4-8 claude "Security-audit the auth module for token handling bugs."
# Fable 5 for the hardest tasks (2× Opus cost; subagents still auto-run on Opus/Sonnet/Haiku)claude --model fable "Rebuild the payment flow — design, implement, and test end-to-end."The valid aliases are haiku, sonnet, opus, and fable; the matching full IDs are claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-8, and claude-fable-5. For subagents, set model: haiku in the subagent config so cheap helpers don’t run on an expensive model. See model comparison for the full capability and pricing breakdown.
MCP servers add tool definitions to your context on every turn, whether or not you use them. If /context shows them eating space, this is free money:
gh, aws, gcloud, and sentry-cli cost no persistent context — Claude just runs them. An MCP server for the same job sits in context idle./mcp to list configured servers and turn off the ones you’re not using this session.ERROR before Claude sees it can cut tens of thousands of tokens to hundreds. A “codebase-overview” skill hands Claude your architecture directly instead of making it read a dozen files to infer it.For interactive work, /cost and the status line are enough. For scripted or CI runs, get structured usage out of print mode instead of grepping interactive output (which doesn’t emit per-token lines):
# Structured usage + result, machine-readableclaude -p --output-format json "summarize today's changes" | jq '.usage, .total_cost_usd'
# Hard budget ceiling for an automated run — stops before it overspendsclaude -p --max-budget-usd 2.00 "fix all type errors in src/"--max-budget-usd and --output-format json are print-mode (-p) flags. For ongoing visibility across a team, Claude Code’s usage analytics and OpenTelemetry metrics report token spend over time — wire those into your dashboards rather than parsing logs by hand.