Skip to content

Reducing MCP Token Cost with Code Execution

You connected the GitHub server, a database server, a browser server, and a docs server. Useful — except the agent now drags every tool definition from all four into its context window at the start of every turn, whether it uses them or not. A single Chrome DevTools MCP loads dozens of tool schemas. Before you type a single word, tens of thousands of tokens are gone, responses are slower, and the model occasionally reaches for the wrong tool because it is staring at forty of them.

There is a better way to wire high-fan-out servers into an agent: let the model call them as code, on demand, instead of pre-loading every schema. Anthropic’s engineering team documented this “code execution with MCP” pattern in November 2025, and in one of their examples it cut a workflow from roughly 150,000 tokens of tool definitions and intermediate results down to about 2,000 — a 98.7% reduction.

  • Why every connected MCP server taxes your context window before the conversation starts
  • The code-execution pattern: presenting MCP tools as code the agent reads on demand
  • A practical recipe for wrapping any MCP server as a one-command CLI
  • A decision rule for when this pattern pays off and when it is overkill

Every MCP server you connect injects its full list of tool definitions — names, descriptions, and JSON schemas for every parameter — into the model’s context at the start of each request. A lean server adds 500–2,000 tokens. A rich one, like a browser or cloud-platform server with thirty-plus tools, can add far more. Connect a handful and you are spending real context on capability you may never touch this session.

Two costs compound:

  1. Upfront token tax. Tool descriptions sit in the system prompt every turn, competing with your files and conversation for space.
  2. Intermediate-result overflow. When a tool returns 20,000 characters of JSON, that result also lands in context — and if you chain calls (read from one server, transform, write to another), every intermediate payload flows through the model.

The MCP best practices guide handles the first cost by keeping you to 3–5 active servers. Code execution attacks both costs at once, and it lets you keep a heavy server available without paying for it every turn.

The idea is simple: instead of presenting tools as schemas the model must hold in context, present them as code on a filesystem the model reads on demand. The agent writes a small script that calls only the tools it needs, runs it in a sandbox, and gets back only the final result — not every intermediate payload.

Anthropic frames the benefits as:

  • Progressive disclosure. The model loads a tool’s definition only when it decides to use it, the same way it reads a source file only when relevant — rather than reading all of them up front.
  • Context-efficient results. Data is filtered and transformed in the execution environment before anything returns to the model, so a 20,000-row query can come back as the three rows that mattered.
  • Composition. The agent can chain several tool calls in one script and persist intermediate results, instead of round-tripping each step through the context window.

Practical Recipe: Wrap an MCP Server as a CLI

Section titled “Practical Recipe: Wrap an MCP Server as a CLI”

You do not need to build a sandbox to get most of the benefit. The most accessible version of this pattern is to compile an MCP server into a single command-line tool and hand the agent that command plus its --help. The model then learns the interface on demand — calling --help once — instead of carrying every tool schema all session.

The community tool mcporter (by Peter Steinberger) generates a self-contained CLI from any MCP server:

Terminal window
# Generate and compile a standalone CLI from any MCP server.
# Drop --command when the inline command is the first positional argument.
npx mcporter generate-cli "npx -y chrome-devtools-mcp@latest" --compile

That produces a single binary wrapping every tool the server exposes. Now, instead of connecting the server and loading its schemas, the agent runs the CLI:

Terminal window
# The agent discovers the interface on demand — one --help call, not 30 schemas
./chrome-devtools-mcp --help
# → it then calls only the subcommands it needs, generated from the server's real
# tool names (navigate_page, performance_start_trace, performance_stop_trace, …)

Chrome DevTools MCP is a good candidate precisely because it is tool-heavy — it exposes performance tracing, console inspection, network monitoring, DOM input, and heap snapshots. Loaded as a normal MCP server, all of those schemas sit in context every turn. Wrapped as a CLI, the agent pulls in only the one subcommand it actually invokes.

All three tools drive a CLI the same way — through their shell/Bash tool — so this pattern is tool-agnostic:

Cursor’s agent runs terminal commands directly. Point it at the compiled CLI in a project rule or prompt: “Use ./chrome-devtools-mcp --help to discover the interface, then run a performance trace.” The tool schemas never enter the model’s context — only the command output does.

This pattern is an optimization, not a default. Use the simpler “connect the server normally” approach until you feel one of these pains:

Reach for code execution when…Stick with a normal MCP connection when…
A server exposes many tools (10+) you rarely all useYou have 1–3 lightweight servers
Tool responses are large (schemas, query dumps, page content)Responses are small and you read them as-is
You call the server repeatedly in a loop or batchYou call it occasionally and interactively
You are chaining several servers and intermediate data is noiseA single round-trip is all you need

The compiled CLI can’t find the server. A generated CLI may reference the MCP server by a relative path; run it from the directory where it was generated, or regenerate with an absolute command. Re-run mcporter generate-cli if the underlying server package updated.

The agent keeps trying to “connect” the CLI as an MCP server. It is conditioned to treat MCP things as servers. State explicitly in your prompt (and ideally a project rule) that the tool is a plain CLI to be run, not registered.

Token usage didn’t drop. If the agent dumps full CLI output into its summary, you saved on schemas but not on results. Instruct it to filter in the script/command and return only what you need — that is where the bulk of the savings live.

The pattern is slower for trivial tasks. Generating and invoking a CLI adds overhead. For one or two light servers, a normal MCP connection is faster and simpler — don’t over-optimize.