Cost Optimization
Reduce spend while keeping the performance gains from this guide
You kick off a large refactor in a session that already has 40 files loaded and a half-hour of conversation behind it. Responses crawl, every turn re-reads the whole window, and your spend climbs while the actual edits stall. The model is not slow — your context is bloated, your model choice is wrong for the task, and nothing is scoped. Performance tuning in Claude Code is mostly about feeding the model exactly what a task needs and nothing more.
Every turn you send re-processes the entire active context window. A session carrying dozens of files, long tool output, and a sprawling conversation pays that cost on every single response. The fix is not a faster model — it is a smaller, sharper window.
Current Claude models in Claude Code carry large windows (200K tokens on Haiku 4.5, up to 1M on Sonnet 4.6 and Opus 4.8 with the [1m] suffix, and a 1M window on Claude Fable 5), but a big window is a budget, not a target. The more you fill, the slower and pricier each turn becomes, and the more likely the model is to fixate on irrelevant material.
Structure your CLAUDE.md files so each one carries only what the model needs at that level. A bloated root memory file is loaded into every session whether it is relevant or not.
# Root CLAUDE.md (keep it short)## Critical Project Info Only- Architecture: Microservices with Node.js- Key commands: npm run dev, npm test- Coding standards: ESLint + Prettier
# Frontend CLAUDE.md## Frontend Specific- Framework: React 18 with TypeScript- State: Zustand stores in /src/stores- Components: /src/components follows atomic designScope subtree-specific guidance to a CLAUDE.md inside that subtree. Claude Code loads memory hierarchically, so backend rules do not need to live in the root file the frontend also pays for.
Scope the working set with --add-dir instead of letting a session pull in the whole tree.
# Pulls broad context, then searches everythingclaude> Analyze the entire codebase and find all TODO comments# Restrict the working set to what mattersclaude --add-dir src/auth src/middleware> Explain the authentication flow, then list TODO comments in src/auth/ about JWT expirationTwo commands control window size mid-session:
/clear wipes the conversation and loaded context entirely. Use it when switching to unrelated work./compact summarizes the conversation to reclaim space while keeping a distilled memory. Use it during a long session that is still on-topic./compact is lossy by design, so guide what it keeps.
Run /context to see usage. It renders a colored grid showing what is filling the window — system prompt, memory files, tools, conversation, and loaded files — so you can tell at a glance whether a few large files or a long conversation is the culprit.
claude> /contextIf the grid shows files dominating, /clear and reload only what you need. If conversation dominates, /compact with guidance.
Match the model to the task. Over-spending on Opus for a typo wastes money and latency; under-spending on Haiku for an architecture decision wastes your time on a weaker answer.
| Task | Recommended model | Why |
|---|---|---|
| Codebase-wide migrations, hardest debugging, long-running tasks | fable (Fable 5) | New tier above Opus; 1M context, 128K output |
| Typos, renames, formatting, mechanical edits | haiku (Haiku 4.5) | Fast and cheap; no deep reasoning needed |
| Feature implementation, routine bug fixes | sonnet (Sonnet 4.6) | Strong everyday default, 1M context |
| Architecture, large refactors, gnarly debugging | opus (Opus 4.8) | Excellent agentic reasoning and top SWE-Bench scores |
You can set or switch the model four ways. There is no auto-switching config — model selection is one of these explicit mechanisms:
/model sonnet (or fable, opus, haiku, opusplan, or a full name like claude-sonnet-4-6)claude --model opusexport ANTHROPIC_MODEL=haikumodel field in .claude/settings.json{ "model": "sonnet"}A practical pattern: start a session in Sonnet, and when you hit a genuinely hard problem, switch up.
# Mechanical cleanup — drop to the cheap model/model haikuRename every occurrence of `getUserData` to `fetchUserProfile` across src/, including imports.
# Hard architectural call — switch up/model opusEvaluate whether to split the monolithic OrderService into separate Order, Payment, andFulfillment services. Lay out the trade-offs and a migration sequence before any code.For supported models you can influence reasoning depth. The extended-thinking budget is enabled at its maximum of 31,999 tokens by default. Use MAX_THINKING_TOKENS to lower it (cheaper, faster) or set it to 0 to disable thinking entirely:
# Trim the thinking budget for a batch of simple editsexport MAX_THINKING_TOKENS=8000
# Disable extended thinking outrightexport MAX_THINKING_TOKENS=0Group repetitive work into one pass instead of paying context overhead per item.
Enumerate the targets first
claude> List every React component under src/components that is missing prop types.Run the batch with explicit guidance
claude> For each component you listed, add prop types inferred from actual usage in the file.Process them in groups of five and report which files you changed.Git is your undo for AI-driven refactors. Branch and commit before letting the model loose, so a bad run is one git reset away.
git checkout -b ai-refactor-authgit commit -am "Checkpoint before auth refactor"
# let Claude work, then if it goes sideways:git reset --hard HEAD~1Have the model write findings to a file once, then reference that file instead of re-deriving the analysis (and re-loading the source) every session.
claude> Analyze all API endpoints and write the results to API_ANALYSIS.md.# later, in a fresh session:claude> Using API_ANALYSIS.md, list every endpoint missing authentication.Do not guess at token usage — measure it. Claude Code exposes real surfaces for this:
/usage — token usage and spend patterns inside the session (consolidates the older /cost and /stats)/context — what is filling the window right nowFor team-wide or longitudinal tracking, enable OpenTelemetry metrics export. This is the only sanctioned programmatic metrics surface — there is no per-operation tokens/duration API to script against.
# Emit Claude Code metrics (tokens, cost, session counts) over OTLPexport CLAUDE_CODE_ENABLE_TELEMETRY=1export OTEL_METRICS_EXPORTER=otlpOr pin it in settings so every session reports:
{ "env": { "CLAUDE_CODE_ENABLE_TELEMETRY": "1", "OTEL_METRICS_EXPORTER": "otlp" }}Point the OTLP exporter at your metrics backend and you get cost-per-session and token-throughput dashboards from real data rather than invented benchmark tables.
Symptom: long delays before each reply.
Recover:
/context to see what is filling the window./clear if loaded files dominate; reload only the directories you need with --add-dir./compact with a keep/drop list if the conversation is the bulk./model haiku or sonnet) for mechanical work.Symptom: answers get vaguer as the session ages.
Recover:
/compact or start fresh.opus for the genuinely hard sub-problem.Symptom: the window is full and turns fail or truncate.
Recover:
/compact aggressively with explicit keep/drop guidance./clear between unrelated tasks instead of carrying everything forward.CLAUDE.md files that load into every session.CLAUDE.md files kept short and scoped to their subtree--add-dir instead of loading the whole repo/clear between unrelated tasks, /compact (with guidance) within a long one/usage, /context, and OTel metrics exportCost Optimization
Reduce spend while keeping the performance gains from this guide
CI/CD Integration
Apply these patterns in automated, headless pipelines
Team Scaling
Performance patterns for large teams sharing a codebase