Skip to content

Performance Optimization for Claude Code

Your Claude Code bill crept up and you can’t say why. One session scanned the whole repo, another kept a stale 80-file context alive across a dozen unrelated questions, and a quick “fix the lint” turned into a full-tree read. The fix isn’t to use Claude less — it’s to feed it less. Smaller, sharper context is cheaper and produces better answers, because the model isn’t diluting its attention across files that don’t matter.

  • A mental model for what actually drives token cost (and why 1M-context models don’t make it free)
  • The real levers for shrinking context: /context, /clear, /compact, --add-dir, and CLI-over-MCP
  • How automatic prompt caching saves you money — and how to stop accidentally busting it
  • A model-selection rule of thumb with the correct env var and aliases
  • A copy-paste set of token-efficient prompts and a real way to track spend

Cost scales with context size: the more Claude reads on each turn, the more you pay. Two things are working in your favor automatically, and one common assumption is no longer true.

Context is the cost driver

Every file, prompt, and prior turn is re-sent on each message. A bloated context taxes every single turn, not just the first one.

Caching is automatic

Claude Code caches stable content (system prompt, repeated context) and bills it at a steep discount. You don’t enable it — but you can bust it.

Compaction is automatic

When you approach the context limit, Claude Code summarizes older history for you. You can steer it with /compact.

1M context isn't free

Fable 5, Opus 4.8, and Sonnet 4.6 all carry 1M-token windows (Haiku 4.5 is 200K). A bigger window removes the hard ceiling, but you still pay per token — discipline still matters.

You can’t optimize what you can’t see. The first move in any session that feels expensive is to look at what’s actually loaded.

  • Run /context inside the claude REPL to visualize what’s consuming the window — files, MCP tool definitions, conversation history.
  • Run /cost to see token usage and spend for the current session.
  • Configure your status line to show context usage continuously, so you notice bloat before it compounds.

If /context shows MCP servers eating a big slice before you’ve done anything, that’s a quick win — see the model and MCP section below.

The biggest savings come from not loading what you don’t need. Claude Code takes your prompt positionally; it does not take file or directory paths as positional arguments. To widen its reach, use --add-dir; to keep it narrow, just say which files matter in the prompt and let agentic search do the rest.

Terminal window
# Wrong: paths are not positional args
# claude "refactor auth" auth/ middleware/ utils/
# Right: name the entry point in the prompt; Claude explores from there
claude "Analyze the auth flow starting from src/auth/core.ts and tell me where
session validation happens."
# Right: widen access deliberately for a monorepo
claude --add-dir ../apps ../lib "Update the shared logger and every app that uses it."

Then manage context across the session:

  1. Clear between unrelated tasks. /clear drops stale context so you’re not paying to re-send last task’s files on every new message.

  2. Compact with focus when you must keep going. /compact Focus on the files we changed and the API contract summarizes history while preserving what matters.

  3. Re-check with /context after big phases to confirm the window is actually lean.

Automatic prompt caching only helps when the cached prefix stays stable. The fastest way to throw the discount away is to change early, stable context — reordering files, swapping the system prompt, or editing a file that sits near the front of the context — which invalidates everything cached after it.

Practical rules:

  • Do unrelated work in a new session (/clear) rather than reshuffling the current one. A clean slate caches cleanly.
  • Keep stable reference material (architecture notes, conventions) in CLAUDE.md so it’s part of the cached system context instead of pasted ad hoc each time.
  • Batch related edits together so the cache is warm for the whole run, rather than interleaving them with unrelated questions that change the context shape.

Vague prompts produce verbose, exploratory responses; specific prompts get specific answers and read fewer files. Be opinionated about scope and output format.

Refactor the auth middleware from the old-jwt library to new-jwt. Keep the public API
identical. Change only src/middleware/auth.ts and its test. Reply with the diff, no prose.

Match the model to the task. Sonnet 4.6 handles most coding work at lower cost; reserve Opus 4.8 for genuinely hard reasoning; reach for Fable 5 when velocity and quality matter more than cost — complex multi-file refactors, building from scratch, or long-running tasks that demand peak intelligence; use Haiku 4.5 for trivial, high-volume edits. Switch mid-session with /model, or set it up front.

The environment variable is ANTHROPIC_MODEL (not CLAUDE_MODEL), and you can use either an alias or a full model ID:

Terminal window
# Aliases — simplest
claude --model sonnet "refactor the authentication system"
claude --model haiku "fix the typo in README.md"
# Full IDs via env var, e.g. for the deepest review
ANTHROPIC_MODEL=claude-opus-4-8 claude "Security-audit the auth module for token handling bugs."
# Fable 5 for the hardest tasks (2× Opus cost; subagents still auto-run on Opus/Sonnet/Haiku)
claude --model fable "Rebuild the payment flow — design, implement, and test end-to-end."

The valid aliases are haiku, sonnet, opus, and fable; the matching full IDs are claude-haiku-4-5, claude-sonnet-4-6, claude-opus-4-8, and claude-fable-5. For subagents, set model: haiku in the subagent config so cheap helpers don’t run on an expensive model. See model comparison for the full capability and pricing breakdown.

MCP servers add tool definitions to your context on every turn, whether or not you use them. If /context shows them eating space, this is free money:

  • Prefer CLI tools over MCP servers when both exist. gh, aws, gcloud, and sentry-cli cost no persistent context — Claude just runs them. An MCP server for the same job sits in context idle.
  • Disable unused servers. Run /mcp to list configured servers and turn off the ones you’re not using this session.
  • Offload preprocessing to hooks and skills. A PreToolUse hook that greps a 10,000-line log for ERROR before Claude sees it can cut tens of thousands of tokens to hundreds. A “codebase-overview” skill hands Claude your architecture directly instead of making it read a dozen files to infer it.
  • Install code-intelligence plugins for typed languages so “go to definition” replaces grep-then-read-five-candidates.

For interactive work, /cost and the status line are enough. For scripted or CI runs, get structured usage out of print mode instead of grepping interactive output (which doesn’t emit per-token lines):

Terminal window
# Structured usage + result, machine-readable
claude -p --output-format json "summarize today's changes" | jq '.usage, .total_cost_usd'
# Hard budget ceiling for an automated run — stops before it overspends
claude -p --max-budget-usd 2.00 "fix all type errors in src/"

--max-budget-usd and --output-format json are print-mode (-p) flags. For ongoing visibility across a team, Claude Code’s usage analytics and OpenTelemetry metrics report token spend over time — wire those into your dashboards rather than parsing logs by hand.