Skip to content

Performance Tuning & Optimization

You kick off a large refactor in a session that already has 40 files loaded and a half-hour of conversation behind it. Responses crawl, every turn re-reads the whole window, and your spend climbs while the actual edits stall. The model is not slow — your context is bloated, your model choice is wrong for the task, and nothing is scoped. Performance tuning in Claude Code is mostly about feeding the model exactly what a task needs and nothing more.

  • A repeatable way to keep the context window lean so responses stay fast and cheap
  • Concrete rules for picking between Haiku, Sonnet, and Opus per task — and how to switch mid-session
  • Copy-paste prompts for focused analysis, batch refactors, and guided compaction
  • Real telemetry: how to actually measure tokens, cost, and duration instead of guessing
  • A recovery playbook for when a session bogs down or hits the context limit

Every turn you send re-processes the entire active context window. A session carrying dozens of files, long tool output, and a sprawling conversation pays that cost on every single response. The fix is not a faster model — it is a smaller, sharper window.

Current Claude models in Claude Code carry large windows (200K tokens on Haiku 4.5, up to 1M on Sonnet 4.6 and Opus 4.8 with the [1m] suffix, and a 1M window on Claude Fable 5), but a big window is a budget, not a target. The more you fill, the slower and pricier each turn becomes, and the more likely the model is to fixate on irrelevant material.

Structure your CLAUDE.md files so each one carries only what the model needs at that level. A bloated root memory file is loaded into every session whether it is relevant or not.

# Root CLAUDE.md (keep it short)
## Critical Project Info Only
- Architecture: Microservices with Node.js
- Key commands: npm run dev, npm test
- Coding standards: ESLint + Prettier
# Frontend CLAUDE.md
## Frontend Specific
- Framework: React 18 with TypeScript
- State: Zustand stores in /src/stores
- Components: /src/components follows atomic design

Scope subtree-specific guidance to a CLAUDE.md inside that subtree. Claude Code loads memory hierarchically, so backend rules do not need to live in the root file the frontend also pays for.

Scope the working set with --add-dir instead of letting a session pull in the whole tree.

Terminal window
# Pulls broad context, then searches everything
claude
> Analyze the entire codebase and find all TODO comments

Two commands control window size mid-session:

  • /clear wipes the conversation and loaded context entirely. Use it when switching to unrelated work.
  • /compact summarizes the conversation to reclaim space while keeping a distilled memory. Use it during a long session that is still on-topic.

/compact is lossy by design, so guide what it keeps.

Run /context to see usage. It renders a colored grid showing what is filling the window — system prompt, memory files, tools, conversation, and loaded files — so you can tell at a glance whether a few large files or a long conversation is the culprit.

claude> /context

If the grid shows files dominating, /clear and reload only what you need. If conversation dominates, /compact with guidance.

Match the model to the task. Over-spending on Opus for a typo wastes money and latency; under-spending on Haiku for an architecture decision wastes your time on a weaker answer.

TaskRecommended modelWhy
Codebase-wide migrations, hardest debugging, long-running tasksfable (Fable 5)New tier above Opus; 1M context, 128K output
Typos, renames, formatting, mechanical editshaiku (Haiku 4.5)Fast and cheap; no deep reasoning needed
Feature implementation, routine bug fixessonnet (Sonnet 4.6)Strong everyday default, 1M context
Architecture, large refactors, gnarly debuggingopus (Opus 4.8)Excellent agentic reasoning and top SWE-Bench scores

You can set or switch the model four ways. There is no auto-switching config — model selection is one of these explicit mechanisms:

  1. Mid-session/model sonnet (or fable, opus, haiku, opusplan, or a full name like claude-sonnet-4-6)
  2. At launchclaude --model opus
  3. Environment variableexport ANTHROPIC_MODEL=haiku
  4. Settings — the model field in .claude/settings.json
.claude/settings.json
{
"model": "sonnet"
}

A practical pattern: start a session in Sonnet, and when you hit a genuinely hard problem, switch up.

# Mechanical cleanup — drop to the cheap model
/model haiku
Rename every occurrence of `getUserData` to `fetchUserProfile` across src/, including imports.
# Hard architectural call — switch up
/model opus
Evaluate whether to split the monolithic OrderService into separate Order, Payment, and
Fulfillment services. Lay out the trade-offs and a migration sequence before any code.

For supported models you can influence reasoning depth. The extended-thinking budget is enabled at its maximum of 31,999 tokens by default. Use MAX_THINKING_TOKENS to lower it (cheaper, faster) or set it to 0 to disable thinking entirely:

Terminal window
# Trim the thinking budget for a batch of simple edits
export MAX_THINKING_TOKENS=8000
# Disable extended thinking outright
export MAX_THINKING_TOKENS=0

Group repetitive work into one pass instead of paying context overhead per item.

  1. Enumerate the targets first

    claude> List every React component under src/components that is missing prop types.
  2. Run the batch with explicit guidance

    claude> For each component you listed, add prop types inferred from actual usage in the file.
    Process them in groups of five and report which files you changed.

Git is your undo for AI-driven refactors. Branch and commit before letting the model loose, so a bad run is one git reset away.

Terminal window
git checkout -b ai-refactor-auth
git commit -am "Checkpoint before auth refactor"
# let Claude work, then if it goes sideways:
git reset --hard HEAD~1

Have the model write findings to a file once, then reference that file instead of re-deriving the analysis (and re-loading the source) every session.

claude> Analyze all API endpoints and write the results to API_ANALYSIS.md.
# later, in a fresh session:
claude> Using API_ANALYSIS.md, list every endpoint missing authentication.

Do not guess at token usage — measure it. Claude Code exposes real surfaces for this:

  • /usage — token usage and spend patterns inside the session (consolidates the older /cost and /stats)
  • /context — what is filling the window right now

For team-wide or longitudinal tracking, enable OpenTelemetry metrics export. This is the only sanctioned programmatic metrics surface — there is no per-operation tokens/duration API to script against.

Terminal window
# Emit Claude Code metrics (tokens, cost, session counts) over OTLP
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp

Or pin it in settings so every session reports:

.claude/settings.json
{
"env": {
"CLAUDE_CODE_ENABLE_TELEMETRY": "1",
"OTEL_METRICS_EXPORTER": "otlp"
}
}

Point the OTLP exporter at your metrics backend and you get cost-per-session and token-throughput dashboards from real data rather than invented benchmark tables.

Symptom: long delays before each reply.

Recover:

  1. Run /context to see what is filling the window.
  2. /clear if loaded files dominate; reload only the directories you need with --add-dir.
  3. /compact with a keep/drop list if the conversation is the bulk.
  4. Drop to a faster model (/model haiku or sonnet) for mechanical work.
  • CLAUDE.md files kept short and scoped to their subtree
  • ☐ Working set limited with --add-dir instead of loading the whole repo
  • /clear between unrelated tasks, /compact (with guidance) within a long one
  • ☐ Model matched to task: Haiku for mechanical, Sonnet for routine, Opus for hard
  • ☐ Batch operations for repetitive edits, with a test gate between groups
  • ☐ Checkpoint commits before large AI-driven changes
  • ☐ Real measurement via /usage, /context, and OTel metrics export

Cost Optimization

Reduce spend while keeping the performance gains from this guide

CI/CD Integration

Apply these patterns in automated, headless pipelines

Team Scaling

Performance patterns for large teams sharing a codebase