Skip to content
robinSenior Software Engineer - Applied AI
All articlesReducing MCP Token Usage By 85%

Reducing MCP Token Usage By 85%

3 Apr 2026 · 6 min read · 1,022 words

MCPs are not dead... I built tldr to cut 86% of MCP schema tokens and 73% of response size with a local gateway that sits between coding agents and upstream MCP servers.

41 GitHub tools cost 24k tokens in context before the model does anything. A 200 item API response is 313KB dumped straight into context. Both problems compound every session.

The Cost Shows Up in Two Different Places

Most people talk about MCP token usage as one problem. It is two problems.

The first problem is schema injection. If your harness connects directly to every upstream MCP server, the model sees every tool definition up front. That cost lands before the first useful call.

The second problem is response volume. Once the model does call a tool, it often gets a large JSON payload back. That payload goes straight into context unless something stands in the way.

tldr is a local MCP gateway that sits between the harness and the upstream servers. The harness connects to one server, tldr serve, instead of talking to the upstream servers directly. That wrapper exposes 5 tools: search_tools, execute_plan, call_raw, inspect_tool, and get_result.

Loading diagram...

That is the whole move. Shrink the control plane. Keep the data plane local until the model asks for a specific fragment.

Tool Compression Works Because Discovery Is Deferred

The wrapper does not forward raw MCP tool schemas to the model. It compiles each upstream tool into a smaller Capability record with 7 fields: serverName, toolName, summary, tags, riskLevel, inputShape, and outputShape.

That matters because the model usually does not need the full JSON schema for 41 tools at the top of the conversation. It needs a short list of candidates that match the current task. search_tools handles that first pass. inspect_tool is the narrow follow-up when one tool needs more detail.

The current GitHub MCP server is a clean example. I measured it using the same path tldr wrap uses in the CLI. The raw tool list contains 41 tools. The serialized raw schemas come out to about 24,473 tokens. The compiled capability index comes out to about 3,482 tokens. That is an 85.77% reduction before the model has made a single tool call.

$ tldr wrap github
Wrapped github: 41 tools -> ~24473 schema tokens -> ~3482 wrapped tokens (86% reduction)
Loading diagram...

The token math is intentionally rough. tldr estimates schema tokens as one token per 4 bytes of serialized JSON. It uses the same approximation for the wrapped capability index. I am not claiming those are universal tokenizer counts. I am claiming the comparison is internally consistent, because both sides use the same estimator.

That consistency is enough for engineering work. If one representation is 24,473 units and the other is 3,482 under the same meter, you do not need a philosophical debate about tokenization to know which one is cheaper.

Response Shielding Fixes the Second Half of the Bill

Shrinking the tool surface is only half the job. If the model still gets a 300 KB API response dumped into context after every call, the savings disappear.

tldr stores the raw result locally, then applies shielding before it returns anything to the harness. The default policy has 3 hard limits. Output is targeted to stay within 64 KB. Arrays are capped at 50 elements. Strings are capped at 8,192 characters.

Those limits are not three competing ideas. They are a sequence.

If the raw result is already small, tldr returns it as-is. If the result is valid JSON, it tries to summarize structurally. Top-level arrays get trimmed to 50 items first, because that usually preserves useful shape at much lower cost. Long strings get clipped to 8,192 characters. If the summarized JSON is still too large, tldr falls back to byte truncation and returns metadata that says exactly what was cut.

The full raw payload does not disappear. It goes into the local result store under a ref such as p1:s1 or raw:3. That ref is what makes shielding practical instead of destructive.

I measured a representative 200-item GitHub-style issues payload through that same shielding path. The raw JSON was 312,877 bytes. The shielded result visible to the model was 85,620 bytes. That is a 72.63% reduction. The main savings came from trimming the top-level array from 200 items to 50 summarized items.

Loading diagram...

This is why I think people underestimate the second part of the problem. Tool compression saves prompt budget before execution. Response shielding saves prompt budget after execution. You need both.

The Result Store Changes the Shape of the Interaction

The reason shielding works is that the model is not trapped with the first response. It can come back for exactly what it needs.

get_result supports plain pagination with offset and limit. It supports field projection with fields. It supports nested navigation with path. It also supports ripgrep-backed search against stored results with pattern, before, after, and max_matches.

That last part matters more than it looks.

I first implemented in-process regex search in Go because it was simple and it kept everything in memory. I switched to the real rg binary because repeated searches over large stored payloads are a different workload. Models already know how to think in ripgrep terms. rg also gives me line-oriented search semantics that match what people expect when they say "grep this response".

The store materializes searchable text into a cache file the first time a result is searched, then reuses that file on the next search. That avoids re-rendering the same large JSON payload on every request. Search results expire with the underlying stored result, and the cache files are deleted on eviction.

This is the shape of the flow:

Loading diagram...

The result store also has hard retention limits. Results live for 1 minute by default. Total in-memory storage is capped at 128 MB. If disk-backed storage is enabled, expired entries are removed from disk during normal access instead of waiting forever. That keeps the wrapper honest. A local cache that grows without a bound is not a cache. It is deferred operational debt.

Install and Try It

If you want to test the wrapper yourself, the install path is one command:

curl -sSfL https://raw.githubusercontent.com/robinojw/tldr/main/install.sh | sh

Then wrap a real MCP server, point your harness at tldr serve, and watch what disappears from the prompt. That is the fastest way to see whether the savings are real in your setup.