OpenClaw processes the entire context window on every single response. Not just what you just said. The whole thing. The default context size is set for general use and is often significantly larger than what your actual workload needs. Slow responses are almost never a model problem. They are a context size problem.
What the context window is and why it affects speed
The context window is the total amount of text your agent can hold at once: your system prompt, conversation history, files it has read, tool outputs, everything. Every time your agent responds, the model processes all of it to generate the next reply.
More context means more processing means longer wait. This is not a bug. It is how language models work. A 200,000 token context window on a long session will feel noticeably slower than a 32,000 token window doing the exact same task. The model must attend to every token in context on every turn, regardless of how much of that content is actually relevant to the current question.
The question is: how much context does your workload actually need? Focused sessions (writing, research, code review) rarely need more than 40,000 to 60,000 tokens. Long autonomous sessions with extensive tool use and file reading need more. The default is set for the high end. Most setups benefit from reducing it.
Step 1: Find out what you are currently set to
The first diagnostic step is also the easiest.
Read my openclaw.json. Tell me the value of agents.defaults.contextTokens. If it is not set there, tell me what the effective default context window is for my current model.
The number you get back is the ceiling. Your agent is not necessarily filling it on every turn, but the model still accounts for the full window size when processing. A larger ceiling means more overhead per response, even on short conversations early in a session. This overhead is why identical questions feel faster in a fresh session than mid-session: the context is smaller at the start.
Common defaults by model:
- Claude Sonnet / Opus: 200,000 tokens available; OpenClaw defaults to 100,000–200,000
- DeepSeek V3: 64,000 tokens available
- GPT-4o: 128,000 tokens available
- Local Ollama models: typically 8,000–32,000 depending on model and hardware
If your contextTokens is set to 200,000 and your sessions rarely exceed 40,000 tokens, you are carrying 160,000 tokens of overhead on every response.
Step 2: Find your actual peak usage
Look at my last several sessions. How many tokens did the context window actually reach before the session ended or compaction fired? What was the highest peak across recent sessions?
Take that peak number and add 20% headroom. That is your target ceiling. The 20% buffer exists because peak usage varies by session type and task. You want a ceiling that handles your heavy sessions without waste, not just your average sessions. If your peak is 35,000 tokens, set contextTokens to 42,000. If it is 80,000, set it to 96,000. If you run long autonomous overnight sessions with heavy tool use, stay closer to the model’s maximum. Most conversational setups can come down significantly from the default.
Step 3: Apply the change
Update agents.defaults.contextTokens in openclaw.json to [your target value]. Show me the proposed change before writing it.
~/.openclaw/openclaw.json. Find or add the agents.defaults.contextTokens key. Set it to your target value (as a number, not a string): "contextTokens": 60000. Save the file.
Critical: Changes to contextTokens require a fresh session to take effect. This is one of the most common points of confusion in OpenClaw configuration. OpenClaw caches the context window value in the session entry on first response. If you update the config and keep using the same session, the old value stays in effect. Start a /new session after applying the change, then verify:
Check my current context window ceiling. What is agents.defaults.contextTokens set to right now? Does the active session reflect the new value?
If the session still shows the old value, it was started before the config change. Start another fresh session and check again.
Step 4: Enable streaming if it is off
Streaming does not reduce total response time. It changes when you start seeing output. With streaming on, text appears as it generates. Without it, you wait for the full response before anything displays. If streaming was disabled without your knowledge, enabling it makes your agent feel dramatically faster even though the underlying model performance is unchanged.
Check whether streaming is enabled in my openclaw.json. If it is off, tell me the exact config change to enable it and apply it.
openclaw.json, look for a streaming key under agents.defaults or at the top level. If it is set to false, change it to true. If the key is absent, streaming is on by default in most OpenClaw versions.
Why sessions get slower over time
You notice this pattern: the first few turns in a session are fast. By turn 15, each response takes noticeably longer. This is the context accumulation effect.
Every turn adds to the context. Your question, the agent’s response, any tool calls and their outputs: all of it stays in context until compaction collapses it. By turn 15, the model is processing 10x or 15x more tokens than it processed on turn 1. The ceiling has not changed, but the actual context in use is growing toward it.
Two things slow this down:
Compaction fires late or not at all
Compaction is designed to collapse older conversation turns into summaries, keeping the active context manageable. If compaction is set to fire too late (high threshold), or if it is compressing too little when it does fire (high retain setting), context grows unchecked. The context window fills toward the ceiling and every subsequent response is slower. Check your compaction settings alongside your context window ceiling.
Large tool outputs staying in context
If your agent regularly reads large files, runs searches that return long results, or makes API calls with large responses, those outputs stay in context until compaction collapses them. A single large file read can add 5,000 to 20,000 tokens to the active context. Across a session with many tool calls, this accumulates fast. The fix is not to avoid tool use. It is to ensure compaction is tuned to handle it.
Read my openclaw.json. Show me the compaction settings: what threshold triggers compaction, how much context is retained after compaction, and how many turns are preserved. Are these settings appropriate for a session that uses frequent tool calls?
How to measure whether your change worked
Before changing the context window, note your current response time for a typical mid-session turn (turn 10 or later). After making the change and starting a fresh session, measure the same thing. The comparison should be direct.
A rough benchmark for measuring without tooling: ask your agent a question that requires a medium-length response (a paragraph or two), at turn 12 or later in a session. Time how long until the response completes. Do this before and after the context window change. A reduction from 200k to 60k on the same model and workload will produce a measurable difference in that second half of a session.
We are now at turn 10 or later in this session. What is the current token count of this context? How much of my contextTokens ceiling has been used? Based on that, am I likely to see a speed difference if I reduce the ceiling to [target value]?
What to do if it is still slow after both changes
Context size and streaming are the two most common causes of perceived slowness. If response time is still a problem after addressing both, the bottleneck is elsewhere.
The model is slow for the tier
Different models have different throughput at the same provider. If you are on a shared inference tier and the provider is under load, responses slow across all their users. This shows up as uniform slowness regardless of context size. The first turn of a fresh session is just as slow as turn 15. Check your provider’s status page. If it is not a provider issue, consider routing tasks to a faster or cheaper model.
Rate limiting causing queued responses
If you are hitting your provider’s rate limits, requests queue and the wait is not model processing time. It is queue time. This often shows up intermittently: some turns are fast, others have an unusual pause before the first token appears. Check whether you are on a tier with tight requests-per-minute limits, especially if you are using a local Ollama model that is also handling extraction and embedding for the memory pipeline simultaneously.
Multiple concurrent sessions competing for resources
If you have cron jobs running, subagents active, or multiple OpenClaw sessions open, they may be competing for the same API rate limit or the same local Ollama resources. A cron job that fires every 5 minutes uses some of the same capacity as your active session. Check what else is running when you notice slowness.
Are there any cron jobs or scheduled tasks currently active? Are there any subagent sessions running? When was the last time a background task fired? I want to understand if something else is competing for model capacity when I notice slowness.
Context window sizing by workload type
There is no universal right answer, but these ranges are reasonable starting points based on common workload patterns. Treat them as a starting point, not a rule:
- Focused Q&A and writing sessions (no heavy tool use): 32,000 to 50,000 tokens. These sessions accumulate context slowly and compaction rarely fires before the session ends naturally.
- Research sessions with web search and file reading: 60,000 to 80,000 tokens. Tool outputs add up quickly. More headroom before compaction is needed.
- Autonomous sessions with multi-step tool chains: 80,000 to 120,000 tokens. Long tool sequences and their outputs stay in context until compaction. More room prevents mid-task compaction from disrupting state.
- Overnight autonomous sessions with extensive file I/O: 120,000 to 200,000 tokens. These sessions need the full window. Do not restrict them.
One thing these ranges do not account for: your system prompt and memory injection together set a floor below which contextTokens cannot be reduced without breaking the session. If your system prompt is 10,000 tokens and autoRecall injects 2,000 tokens per turn, your effective conversation space at a 40,000 ceiling is only 28,000 tokens. That is fine for most conversational sessions. Just do not go below the sum of your system prompt plus expected memory injection volume.
Start at the lower end of the appropriate range. Watch for compaction firing at unexpected points in your workflow. If it interrupts something mid-task, raise the ceiling. If it never fires and sessions always end before approaching the ceiling, lower it further. The goal is a ceiling that is just large enough for your heaviest sessions, with compaction handling anything that goes longer.
What actually fills the context window fastest
Understanding what consumes context helps you make smarter decisions about sizing. The context window is not filled evenly. A few sources dominate.
Your system prompt
SOUL.md, AGENTS.md, USER.md, TOOLS.md, and any other files loaded at startup are all part of the context on every single turn. If your combined instruction files total 8,000 tokens, that 8,000 is present at turn 1 and turn 100. It does not grow, but it is a fixed cost that sets your baseline. A large system prompt means you start the session already using a significant portion of a small context window. Before reducing contextTokens aggressively, check how large your system prompt actually is.
How many tokens is my current system prompt? Include all loaded instruction files (SOUL.md, AGENTS.md, USER.md, TOOLS.md, and anything else loaded at startup). Give me the total and a breakdown by file.
If your system prompt is 12,000 tokens and you set contextTokens to 20,000, you have only 8,000 tokens left for the actual conversation. Compaction fires almost immediately and the session feels broken. The context window must be meaningfully larger than the system prompt alone.
Tool outputs
Every tool call result lands in context. A web search returning 5 results at 500 tokens each adds 2,500 tokens. A file read of a 3,000-line file can add 6,000 to 10,000 tokens. Multiple tool calls in a single turn stack. In a research session with 10 web searches, you have added 25,000 tokens of search results to the context before the conversation content itself is even counted. Tool outputs are the fastest context filler in most active sessions.
Memory injection
If autoRecall is enabled, recalled memories are injected into context on each turn. A healthy memory store with 500 to 1,000 stored memories will surface 5 to 15 memories per turn depending on the retrieval settings. At 200 tokens per memory, that is 1,000 to 3,000 tokens of injected context per turn that did not exist when your system prompt was written. If you have not accounted for memory injection when sizing the context window, your sessions will fill faster than you expect.
Compaction summaries
Compaction replaces older turns with summaries. The summaries are shorter than the original content but not zero. After compaction fires, you gain headroom, but the summary content remains in context. In a long session with multiple compaction cycles, you accumulate layers of summaries. Check what the “retained” setting in your compaction config: this controls how much context is kept after each compaction run.
In the current session, what is using the most context tokens? Break it down: system prompt, conversation turns, tool outputs, and any memory injections. Show me which category is largest.
The relationship between context window and compaction threshold
These two settings are coupled. Changing one without considering the other produces unexpected behavior.
The compaction threshold is the percentage of the context ceiling at which compaction fires. If contextTokens is 100,000 and the compaction threshold is 75%, compaction fires at 75,000 tokens. If you reduce contextTokens to 50,000 with the same threshold, compaction now fires at 37,500 tokens.
In absolute terms, compaction fires twice as frequently. Whether this is a problem depends on your workload. For a conversational session that accumulates context slowly, compaction firing at 37,500 tokens is fine. It fires before the session ends rather than during it. For a tool-heavy autonomous session that regularly uses 30,000 to 40,000 tokens per task, compaction fires mid-task, which can disrupt state.
Read my openclaw.json compaction config. What is the threshold percentage and the retain setting? Given a contextTokens of [your target], at what token count does compaction fire? Is that high enough for a session that does [describe your typical workload]?
The safe approach: reduce the context window, then run a few normal sessions and watch for compaction firing unexpectedly. “Unexpectedly” here means mid-task, when you are in the middle of a multi-step operation and compaction collapses state you still need. If compaction fires between tasks or at the end of a long session, it is working as intended. If it interrupts a task, either raise the window slightly or lower the threshold percentage to give more headroom before compaction triggers.
Per-agent context windows
If you run multiple agents (a main conversational agent, research subagents, autonomous cron workers), each has different context requirements. Setting a single global contextTokens value forces all of them to the same ceiling, which means either the cron workers are over-provisioned or the research agent is under-provisioned.
You can set contextTokens per agent in the agent configuration. This requires knowing which agents exist in your setup and what their typical usage looks like. But even a rough segmentation is better than one global value:
- Main conversational agent: set based on your typical session length and system prompt size. 40,000 to 80,000 is reasonable for most setups.
- Research subagents: higher ceiling, since they handle more tool output accumulation. 80,000 to 120,000.
- Cron workers for simple tasks (morning brief, queue check): small ceiling. These tasks complete in 2 to 5 turns with minimal tool use. 20,000 to 30,000 is often sufficient and reduces cost on every cron fire.
List all the agents configured in my openclaw.json. For each one, tell me what contextTokens is set to (or whether it inherits the default). Based on what I have told you about my workload, suggest per-agent values that would reduce overhead without causing context-related failures.
How to verify the change took effect in a new session
This is a step most people skip, and then wonder why performance has not changed. The config change and the session that applies it are separate events.
After changing contextTokens and starting a fresh session:
- Run
/statusin the new session. The context window display should show the new value (e.g.,0/60kif you set it to 60,000). - If it still shows the old value, the session was started before the config was saved, or the config was not written correctly. Check the file and start another fresh session.
- Never verify by asking the agent what contextTokens is: it reads the value from the config file and reports it correctly even if the current session is running with the old cached value from before your change. The
/statuscommand shows the actual live session value.
What is the context window showing in the current session status? Does it match the contextTokens value in openclaw.json, or does it show a different number? If they do not match, what does that indicate?
/new) to take effect. OpenClaw caches contextTokens in the session entry on first response. No config patch or gateway restart updates a running session. This is by design. Always start a new session to verify context-related config changes.
Adjusting context window on Ollama models
Local Ollama models have context limits determined by two things: the model’s architecture and the hardware memory available. These limits are independent of what you set in openclaw.json.
Most 7B and 8B parameter models support 8,000 to 32,000 tokens depending on the version. The phi4 model supports up to 16,000 tokens by default. llama3.1:8b supports 128,000 tokens with sufficient memory. Setting contextTokens higher than the model can actually handle causes the model to use its hardware-limited maximum, not the value you specified.
To check a local model’s actual context limit:
ollama show llama3.1:8b
# Look for: context length or num_ctx in the output
You can also increase a local model’s context window by creating a custom Modelfile:
FROM llama3.1:8b
PARAMETER num_ctx 32768
Then build and use the extended model with a recognizable name: ollama create llama3.1-32k -f Modelfile. This requires enough hardware memory. Each additional 1,000 context tokens uses roughly 1-2MB of VRAM or RAM. A 32,768 token context on an 8B model requires approximately 8GB of memory for the model weights plus 2-4GB for the extended context.
Platform-specific notes
macOS (local Ollama)
On Apple Silicon (M1/M2/M3/M4), Ollama uses the unified memory architecture, meaning CPU and GPU share the same pool. A model with a large context window competes with the OS and other applications for that pool. If you experience slowdowns when running a large context on a local model on macOS, reducing the context window is often more effective than it is on non-unified memory systems.
Linux VPS (no GPU)
On a CPU-only VPS running local Ollama models, every additional token in context adds to processing time proportionally. A 32,768 token context takes roughly twice as long to process per turn as a 16,384 token context on the same hardware. If you are running OpenClaw on a VPS with local models for cost reasons, keeping context windows tight is essential. Set contextTokens as low as your workload tolerates.
API models (Anthropic, OpenAI, DeepSeek)
For API models, context window size affects both cost and latency. Anthropic’s pricing charges per input token on every turn. A session with a 150,000 token context sends 150,000 tokens on every request, regardless of whether the model needs that much history to answer your question. The cost and latency are proportional to tokens sent. Reducing contextTokens and letting compaction manage history is both faster and cheaper for API models.
Common mistakes when resizing the context window
Setting it below the system prompt size
If your system prompt is 10,000 tokens and you set contextTokens to 12,000, you have only 2,000 tokens for the actual conversation and all tool outputs combined. Compaction fires almost immediately on every turn and the session becomes nearly unusable. Before reducing contextTokens, confirm your system prompt size and set a floor at least 3x larger than the system prompt.
Forgetting to start a new session after the change
The most common mistake. The config is updated, the session continues, and nothing changes. The old session is running with the cached value from session start. Always /new after a context window config change and verify with /status.
Reducing the window but not the compaction threshold
A smaller window with the same threshold percentage fires compaction earlier in absolute token terms. If you reduce from 100,000 to 40,000 tokens, compaction fires at 30,000 instead of 75,000. For workloads that fill context in large chunks (big file reads, long web search results), this causes frequent mid-task compaction. Raise the threshold percentage slightly (from 75% to 85%) when reducing the window significantly.
Not accounting for memory injection growth
As the memory store grows over weeks of use, autoRecall injects more relevant memories per turn. A context window sized appropriately for a new setup with an empty memory store may become too small after three months of accumulated memories. If you notice sessions filling faster than they used to after extended use, memory injection growth is a likely cause. Re-evaluate context sizing after each month of active use.
Reading the speed difference: before and after
One thing that helps is having a concrete baseline before you change anything. Ask your agent a fixed question at a known point in a session, note the response time, and then repeat that test after the context window change. The comparison makes the improvement tangible rather than approximate.
A good test query is a medium-complexity question that produces a 150 to 250 word response. Ask it at turn 12 of a session (enough turns to have accumulated meaningful context), time the response from submission to completion, and record it. Then change the context window, start a fresh session, run 12 equivalent turns, ask the same question, and time it again.
For most setups moving from a default 200k window to a 60k window on the same model and workload, the difference becomes noticeable at turns 8 and above. This is because turns 1 through 7 in a typical session have not yet accumulated enough content to approach the ceiling difference between 60k and 200k. Early in the session, the context is small enough that the ceiling does not matter. Later in the session, the effective context in use diverges significantly depending on which ceiling is in effect.
We are at turn 12 or later. Without doing anything else, write a 200-word summary of what we have discussed in this session. I want to use response time as a baseline before changing the context window.
Time that response. Then make the change, start fresh, and repeat the same test at the same turn count. If you see no difference, either the ceiling was not the bottleneck or the sessions are too short for the difference to matter at your current workload. If you see a clear difference, you have confirmed the window size was affecting performance and the change was worthwhile. Note the new baseline for future reference.
Complete fix
Brand New Claw
The complete production configuration guide. Every setting that matters, what it does, and what breaks if you leave it at default. Drop it into your agent and it audits your current config and fixes what needs fixing.
FAQ
Does reducing contextTokens affect what my agent can remember within a session?
Only if your sessions regularly use more context than the new ceiling allows. If your typical session never gets close to the new ceiling, the agent has access to the same amount of history it always did. If you set contextTokens to 40,000 and your sessions never exceed 30,000 tokens, nothing changes. If a session hits the ceiling, compaction fires to collapse older content. If compaction is not enabled, the session would hit the hard limit and no further input would be accepted. Keep compaction enabled whenever you reduce the context window.
Will changing contextTokens affect my Anthropic bill?
Not directly. contextTokens is the ceiling for what OpenClaw will send to the model. Your bill is based on the tokens actually sent, not the ceiling. If your sessions routinely send 150,000 tokens because you have a large system prompt and long history, reducing the ceiling to 60,000 will reduce input token costs, because compaction will fire earlier and collapse that history. The ceiling change and the cost change are connected through compaction behavior, not directly.
Is there a way to see the current context size without asking the agent?
In the OpenClaw dashboard (if you have it configured), there is a context usage indicator per session. In the CLI, running /status shows the current context usage. If neither is available, asking the agent directly is the simplest option: “What is the current token count of this context?”
My agent’s responses are slow even in the first turn of a fresh session. Is that a context window problem?
No. The first turn has the smallest context (just the system prompt and your first message). If the first turn is slow, the bottleneck is model throughput, provider load, or network latency. Check your provider’s status page. If the provider is healthy, try a faster or smaller model for the same task.
How does the context window setting interact with compaction?
Compaction fires when the context reaches a configured threshold (often 75 to 80 percent of the ceiling, configurable in your openclaw.json). Reducing the ceiling moves the absolute trigger point lower but keeps the percentage threshold the same. A 60,000 token ceiling with a 75% threshold fires compaction at 45,000 tokens. A 200,000 ceiling with the same threshold fires at 150,000 tokens. Both fire at the same percentage of fill. If you reduce the ceiling, compaction fires earlier in token terms, which is fine. That is the point.
Can I set different context windows for different agents?
Yes. The agents.defaults.contextTokens setting is the default for all agents. You can override it per agent in the agent configuration. An autonomous cron worker that reads large files benefits from a larger window. A focused conversational agent can use a smaller one. Setting them independently prevents the cron worker’s requirements from inflating the conversational agent’s context overhead.
What happens if I set contextTokens higher than the model’s actual maximum?
OpenClaw sends that value to the provider’s API. The provider ignores values above the model’s hard maximum and uses the model’s actual limit. So setting it too high does not break anything, but it also does not expand the window beyond what the model supports. The practical effect is the same as leaving it at the model’s maximum.
I reduced contextTokens and compaction is now firing mid-task and disrupting my workflow. What should I do?
Raise the ceiling slightly, or adjust the compaction threshold so it fires later in the fill cycle. A compaction threshold of 85% instead of 75% gives more room before compaction interrupts. Alternatively, look at what is filling context fastest. Large tool outputs are the most common culprit. Limiting the size of individual tool responses keeps the context from filling too quickly against a tighter ceiling.
Does contextTokens affect local Ollama models differently than API models?
Yes, in one important way. Local Ollama models are limited by the hardware they run on. A model configured with a 32,000 token context window on hardware with 8GB of VRAM cannot be expanded beyond that hardware limit. Setting contextTokens higher than the local model’s configured limit causes the model to ignore the excess or error. For local models, check the model’s actual context limit with ollama show modelname before setting contextTokens.
Go deeper
The compaction settings that bite you later
Compaction fires too early, too late, or not at all. Here is how to tune it so it works with your workload instead of against it.
What actually fills your OpenClaw context window
System prompts, tool outputs, memory injections: understanding what takes up space helps you decide what to cut.
How to choose a model based on your actual workload
If the context window change did not fix the slowness, the bottleneck is the model. Here is how to pick one that matches what you actually do.
