Claude API Context Window Management for Custom Tool Design

Context windows fill faster than the documentation suggests. Most of that overhead isn’t conversation, it’s the tool definitions themselves, sitting in context unused, turn after turn. We’ve rebuilt enough first-generation Claude integrations to say this plainly: the demo works, production breaks at turn 20, and the cause is almost always the schema architecture, not the model.

How Claude’s Context Window Actually Works

Tokens in, tokens out, what fills the window

Claude’s context window holds everything: the system prompt, tool definitions, conversation history, tool call results, and model responses. Every turn adds to that running total. Token counts compound across a session. A tool call that returns a 500-token JSON payload does so every time it fires, and that history stays in context unless you explicitly remove it.

The practical implication: an agent designed for a single-turn demo degrades over multi-turn sessions. The window fills. Claude starts dropping context, hallucinating prior steps, or returning truncated answers. This is the failure pattern Designodin sees most often in production audits of first-generation SMB Claude builds.

The 200K standard window and the 1M beta

Claude’s standard context window is 200,000 tokens, long by most model standards, but not infinite. A 1M token beta window exists for usage tier 4 organizations, accessed via the context-1m-2025-08-07 beta header. Most SMBs building Claude-powered tools are not on tier 4 and cannot assume 1M is available. Design for 200K. Treat anything beyond that as a future option, not a fallback.

Why Tool Definitions Are the Biggest Hidden Context Cost

What a bloated tool schema looks like

A tool schema includes the tool name, description, and input property definitions with types and descriptions. Written carelessly, a single tool definition runs 300–600 tokens. Write 20 tools that way and you’ve spent up to 12,000 tokens before the conversation starts. At 1,000 API calls per day, that overhead compounds into real cost fast.

Here’s what bloat looks like in practice. A “search_crm_contacts” tool that describes every field in your CRM schema, lists all possible filter combinations, and includes usage examples in the description is doing work the model doesn’t need. Claude needs enough context to call the tool correctly, not a documentation page embedded in the schema.

The action parameter pattern: consolidating tools

One effective way to reduce schema overhead is consolidating related tools behind a single action parameter. Instead of three separate tools, create_ticket, update_ticket, close_ticket, you define one ticket_manager tool with an action enum: create, update, close. Three schema definitions become one. The token saving is immediate. This pattern works when the actions share similar input shapes. It breaks down when inputs diverge significantly, in that case, separate tools are cleaner.

Four Strategies for Managing Context in Claude Tool Use

Tool search, definitions out of context until needed

Tool search keeps tool definitions entirely out of the context window until Claude explicitly requests them. You register tools in a tool registry. Claude retrieves only the definitions it needs for a given step. This is the highest-use strategy for agents with large tool sets, ten, twenty, fifty tools. The upfront token cost drops to near zero. The tradeoff is added latency on the retrieval step and additional infrastructure to maintain. For Claude tools that access thousands of integrations, tool search is not optional, it’s the only viable architecture.

Prompt caching, amortizing stable tool definitions

Prompt caching lets you mark a portion of your prompt as cacheable. Anthropic stores that cache for up to 5 minutes (extendable with refreshes). On subsequent requests that hit the cache, you pay a fraction of the normal input token cost, roughly 10% on cache hits vs. 100% on misses. Tool definitions that don’t change between calls are ideal cache targets. System prompts with stable business logic, tool schemas for fixed integrations, and reference data injected into context all benefit from caching. The implementation requires structuring your prompt so the cacheable portion appears at the end of the static content, before the dynamic conversation turns.

Context editing, pruning stale tool results

Context editing means programmatically removing tool call results from conversation history once they’ve served their purpose. A tool result that fetched order data for step 3 of a 12-step workflow is dead weight by step 8. Claude has already used it to reason forward. Keeping it wastes tokens. Context editing requires you to track which results are still load-bearing for the agent’s current reasoning state. This is more work than server-side compaction, but it gives you precise control over what stays in context. For high-volume or long-running agents, the token savings justify the engineering investment.

Server-side compaction, when to use it and when not to

Claude’s server-side compaction automatically summarizes older portions of conversation history to free up context space. It requires no implementation work on your side. The downside: compaction is a lossy operation. It summarizes rather than preserves, which means specific data, exact field values, prior decisions, intermediate tool results, can be abstracted away. Use server-side compaction for general conversational agents where fidelity of every prior turn is not critical. Avoid it for agents where the model needs to reference exact tool outputs from earlier in the session. For those, context editing is the right tool.

Designing Tool Responses That Don’t Bloat the Window

Return only high-signal fields

When your tool fetches data from an external API, don’t pass the raw response back to Claude. A CRM API response for a single contact might return 40 fields, billing address, internal IDs, audit timestamps, deprecated fields kept for legacy compatibility. Claude needs five of them. Strip the rest before the response enters context. This requires a thin transformation layer between your external API and the Claude tool result. The engineering effort is low. The token savings across a session with multiple tool calls are substantial.

Semantic identifiers over opaque references

When a tool result needs to reference an entity that Claude will interact with later, a document, a ticket, a user, use a short semantic identifier rather than a raw UUID or internal numeric ID. ticket_4829 is 12 tokens. t4829 is 4. At scale, across many tool results in a long session, the difference adds up. More importantly, semantic identifiers give Claude better context for reasoning. A model reasoning about ticket_status_open_4829 has more signal than one reasoning about 8f3a-9b12-0047.

Real-World Architecture: What This Looks Like in Practice

Consider a Claude-powered internal support agent built for a mid-sized e-commerce operation. Initial build: 22 tools, each with detailed descriptions, full response payloads returned to context, no caching. Demo looked fine. In production, agents handling complex queries with 8+ tool calls started failing, missing context from earlier in the session, returning incomplete answers on order history queries.

The rebuild took three days. Tool definitions were consolidated from 22 to 14 using the action parameter pattern. Descriptions were trimmed to functional minimums. Tool responses were filtered to return only the 4–6 fields Claude actually needed per tool. Prompt caching was implemented on the system prompt and stable tool schemas. Context editing was added to prune fulfilled tool results after two turns. The result: token consumption per session dropped by 60%. Agent failure rate on multi-step order history queries dropped from roughly 1-in-4 sessions to near zero on clean inputs. Cost per 1,000 sessions fell enough to matter at production volume. These gains depend on inputs being structured, agents handling freeform or ambiguous user queries still require additional guardrails.

That’s the kind of architecture work that goes into a production-grade Claude tool. If you want to understand what this looks like scoped to your operation, see designodin.com/ai.

Frequently Asked Questions

How many tokens do Claude tool definitions typically consume?

A single well-written tool definition runs 150–300 tokens. A verbose one, with lengthy descriptions, usage examples, and detailed property annotations, can hit 400–600 tokens. At 20 tools, that’s 3,000–12,000 tokens consumed before the first user message. The fix is trimming definitions to what Claude needs to call the tool correctly, not what a human developer needs to understand it.

What is Claude’s context window limit in 2026?

The standard Claude context window is 200,000 tokens across Claude 3 and Claude 3.5 model families. A 1M token beta window is available for usage tier 4 organizations via the context-1m-2025-08-07 beta header. Most SMBs building on the Claude API are not at tier 4, design for 200K and treat the 1M window as a future option, not a default assumption.

Can I use prompt caching with custom tool definitions?

Yes. Tool definitions are ideal candidates for prompt caching because they’re stable, they don’t change between requests. Structure your prompt so the system prompt and tool definitions appear before the cache breakpoint. Subsequent requests that hit the cache pay roughly 10% of the normal input token cost for that portion. For high-volume tools with large, stable schemas, prompt caching reduces input token costs on those portions by roughly 90% per cache hit, the most direct cost lever available before you start redesigning schemas.

What’s the difference between server-side compaction and context editing?

Server-side compaction is automatic, Anthropic’s infrastructure summarizes older conversation turns to free up space. It’s easy to implement (effectively zero effort) but lossy. Context editing is manual, you explicitly remove specific messages or tool results from the conversation history you send to the API. It requires more engineering work but preserves the data you choose to keep and removes exactly what you want gone. Use compaction for general agents; use context editing for precision workflows where specific tool outputs must remain accurate in context.

When should I use tool search instead of sending all tool definitions upfront?

Use tool search when your agent has more than 10–15 tools, when tool sets vary significantly by user or workflow, or when you’re building an integration-heavy agent that could grow to 50+ tools. Sending all definitions upfront is fine for small, stable tool sets (under 10 tools, definitions under 2,000 tokens total). Beyond that threshold, the upfront token cost of loading unused tool definitions outweighs the added latency of on-demand tool retrieval.

Should tool response filtering happen inside the tool function or in Claude’s prompt?

Filter inside the tool function, before the response enters context. Asking Claude to ignore irrelevant fields in its prompt doesn’t save tokens; those fields are already in the window. The transformation layer between your external API and the Claude tool result is the right place to strip noise. Keep that layer thin, deterministic, and fast. Claude’s job is to reason about the filtered output, not to parse raw API responses.

Building a Claude-powered tool that works in demos is straightforward. Building one that holds up at production volume, across long sessions, with real cost discipline, that’s the harder engineering problem. If you’re hitting context failures or carrying unexpectedly high API costs, the issue is almost always architectural. If you want to talk through what this looks like for your operation, start a conversation.