Performance Optimization for Claude Code

Your Claude Code bill crept up and you can’t say why. One session scanned the whole repo, another kept a stale 80-file context alive across a dozen unrelated questions, and a quick “fix the lint” turned into a full-tree read. The fix isn’t to use Claude less — it’s to feed it less. Smaller, sharper context is cheaper and produces better answers, because the model isn’t diluting its attention across files that don’t matter.

What You’ll Walk Away With

A mental model for what actually drives token cost (and why 1M-context models don’t make it free)
The real levers for shrinking context: /context, /clear, /compact, --add-dir, and CLI-over-MCP
How automatic prompt caching saves you money — and how to stop accidentally busting it
A model-selection rule of thumb with the correct env var and aliases
A copy-paste set of token-efficient prompts and a real way to track spend

What Actually Costs Tokens

Cost scales with context size: the more Claude reads on each turn, the more you pay. Two things are working in your favor automatically, and one common assumption is no longer true.

Context is the cost driver

Every file, prompt, and prior turn is re-sent on each message. A bloated context taxes every single turn, not just the first one.

Caching is automatic

Claude Code caches stable content (system prompt, repeated context) and bills it at a steep discount. You don’t enable it — but you can bust it.

Compaction is automatic

When you approach the context limit, Claude Code summarizes older history for you. You can steer it with /compact.

1M context isn't free

Fable 5, Opus 5, and Sonnet 5 carry 1M-token windows on the Anthropic API (Haiku 4.5 is 200K). Claude Code plan and provider availability can differ: Opus 1M is included on Max, Team, and Enterprise but needs usage credits on Pro, while gateways can budget Sonnet 5 at 200K unless you select its 1M variant. A bigger window removes the hard ceiling, but you still pay per token.

See What’s In Your Context

You can’t optimize what you can’t see. The first move in any session that feels expensive is to look at what’s actually loaded.

Run /context inside the claude REPL to visualize what’s consuming the window — files, MCP tool definitions, conversation history.
Run /cost to see token usage and spend for the current session.
Configure your status line to show context usage continuously, so you notice bloat before it compounds.

If /context shows MCP servers eating a big slice before you’ve done anything, that’s a quick win — see the model and MCP section below.

Scope Context Tightly

The biggest savings come from not loading what you don’t need. Claude Code takes your prompt positionally; it does not take file or directory paths as positional arguments. To widen its reach, use --add-dir; to keep it narrow, just say which files matter in the prompt and let agentic search do the rest.

# Wrong: paths are not positional args
# claude "refactor auth" auth/ middleware/ utils/

# Right: name the entry point in the prompt; Claude explores from there
claude "Analyze the auth flow starting from src/auth/core.ts and tell me where
session validation happens."

# Right: widen access deliberately for a monorepo
claude --add-dir ../apps ../lib "Update the shared logger and every app that uses it."

Then manage context across the session:

Clear between unrelated tasks. /clear drops stale context so you’re not paying to re-send last task’s files on every new message.
Compact with focus when you must keep going. /compact Focus on the files we changed and the API contract summarizes history while preserving what matters.
Re-check with /context after big phases to confirm the window is actually lean.

Copy-paste prompt to make Claude help you stay lean:

Before you start, list the minimal set of files you need to read to do this task and
why. If a file isn't needed, don't open it. Then proceed.

Don’t Bust the Cache

Automatic prompt caching only helps when the cached prefix stays stable. The fastest way to throw the discount away is to change early, stable context — reordering files, swapping the system prompt, or editing a file that sits near the front of the context — which invalidates everything cached after it.

Practical rules:

Do unrelated work in a new session (/clear) rather than reshuffling the current one. A clean slate caches cleanly.
Keep stable reference material (architecture notes, conventions) in CLAUDE.md so it’s part of the cached system context instead of pasted ad hoc each time.
Batch related edits together so the cache is warm for the whole run, rather than interleaving them with unrelated questions that change the context shape.

Write Token-Efficient Prompts

Vague prompts produce verbose, exploratory responses; specific prompts get specific answers and read fewer files. Be opinionated about scope and output format.

Refactor the auth middleware from the old-jwt library to new-jwt. Keep the public API
identical. Change only src/middleware/auth.ts and its test. Reply with the diff, no prose.

Bug: intermittent "Invalid token format" on mobile login.
Likely site: JWT validation in src/mobile-auth.ts around line 42.
Read only that file and its direct imports. Show the corrected function and a one-line
explanation. Don't refactor anything else.

Implement Redis caching for getThing():
- check cache, on miss fetch + store with a 5-minute TTL
- TypeScript, with error handling for a Redis outage (fail open to the source)
Write the implementation and one test. Code first, brief notes after.

Copy-paste prompt for a batched multi-file change (one focused pass beats many tiny ones):

Add email verification to the user system. Plan it as: 1) add emailVerified to the user
model + migration, 2) a verification service, 3) request/confirm endpoints, 4) tests.
Implement it in that order in one pass. Touch only the user model, user service, the auth
controller, and their tests.

Pick the Right Model

Match the model to the task. Sonnet 5 handles most coding work at lower cost; reserve Opus 5 for genuinely hard reasoning; reach for Fable 5 when velocity and quality matter more than cost — complex multi-file refactors, building from scratch, or long-running tasks that demand peak intelligence; use Haiku 4.5 for trivial, high-volume edits. Switch mid-session with /model, or set it up front.

The environment variable is ANTHROPIC_MODEL (not CLAUDE_MODEL), and you can use either an alias or a full model ID:

# Aliases — simplest
claude --model sonnet "refactor the authentication system"
claude --model haiku  "fix the typo in README.md"

# Full IDs via env var, e.g. for the deepest review
ANTHROPIC_MODEL=claude-opus-5 claude "Security-audit the auth module for token handling bugs."

# Fable 5 for the hardest tasks (2× Opus cost; set cheaper subagent models explicitly)
claude --model fable "Rebuild the payment flow — design, implement, and test end-to-end."

The current family aliases are haiku, sonnet, opus, and fable; on the Anthropic API they resolve to Haiku 4.5, Sonnet 5, Opus 5, and Fable 5. Use claude-haiku-4-5-20251001, claude-sonnet-5, claude-opus-5, or claude-fable-5 when you need pinned first-party IDs. Third-party aliases can lag. For subagents, set model: haiku explicitly so cheap helpers do not inherit an expensive model.

Trim Model and MCP Overhead

MCP servers add tool definitions to your context on every turn, whether or not you use them. If /context shows them eating space, this is free money:

Prefer CLI tools over MCP servers when both exist. gh, aws, gcloud, and sentry-cli cost no persistent context — Claude just runs them. An MCP server for the same job sits in context idle.
Disable unused servers. Run /mcp to list configured servers and turn off the ones you’re not using this session.
Offload preprocessing to hooks and skills. A PreToolUse hook that greps a 10,000-line log for ERROR before Claude sees it can cut tens of thousands of tokens to hundreds. A “codebase-overview” skill hands Claude your architecture directly instead of making it read a dozen files to infer it.
Install code-intelligence plugins for typed languages so “go to definition” replaces grep-then-read-five-candidates.

Track What You’re Spending

For interactive work, /cost and the status line are enough. For scripted or CI runs, get structured usage out of print mode instead of grepping interactive output (which doesn’t emit per-token lines):

# Structured usage + result, machine-readable
claude -p --output-format json "summarize today's changes" | jq '.usage, .total_cost_usd'

# Hard budget ceiling for an automated run — stops before it overspends
claude -p --max-budget-usd 2.00 "fix all type errors in src/"

--max-budget-usd and --output-format json are print-mode (-p) flags. For ongoing visibility across a team, Claude Code’s usage analytics and OpenTelemetry metrics report token spend over time — wire those into your dashboards rather than parsing logs by hand.

When This Breaks

Costs spike for no obvious reason. Run /context — you’re almost certainly carrying a stale, oversized context. /clear and re-anchor.
The cache discount vanished. You changed early/stable context (reordered files, swapped the system prompt, edited a front-of-context file). Prefer a fresh session over reshuffling.
Compaction dropped something you needed. Auto-compaction summarizes aggressively. Use /compact with explicit focus instructions, or /clear and reload only the essentials.
--model haiku gives weak results on a hard task. Haiku is for trivial, high-volume work; step back up to sonnet, opus, or fable when the reasoning matters more than the cents.
A script “tracks tokens” but logs nothing. Interactive output doesn’t print per-token counts. Switch to claude -p --output-format json and read the usage field.

What’s Next

Cost Control — settings, limits, and team-wide budgeting
Monitoring Costs — OpenTelemetry and usage analytics in depth
Multi-File Workflows — keep large changes accurate without ballooning context
Efficiency Hacks — more speed-and-cost shortcuts