OpenClaw Context Window Explained: How to Maximize Memory and Minimize Costs

Most OpenClaw operators have no idea what is eating their context window. They fire up an agent, load the same memory file they have been using for months, and watch their costs creep upward without understanding why. The answer is hiding in plain sight: every token that loads into every request costs real money, and most of those tokens are dead weight.

This article walks through every component that fills the context window, explains why it matters for both cost and model quality, and gives you specific, actionable strategies to keep memory high and costs low.

What Goes Into the Context Window: A Complete Breakdown

The context window is the total amount of text (measured in tokens) that a model can process in a single request. Think of it as the model’s working desk. Everything on that desk — your instructions, your background files, the conversation history, and the latest tool results — must fit within the context window. When the desk is full, older items get shoved off.

Here is what occupies the space in a typical OpenClaw request:

  • System prompt. The core instructions that define the agent. Usually 500 to 2,000 tokens.
  • SOUL.md. The agent’s identity and role definition. Typically 2,000 to 8,000 tokens depending on complexity.
  • AGENTS.md. Operational protocols for the agent. Usually 1,000 to 5,000 tokens.
  • MEMORY.md. The continuous memory store. This is the variable that grows without bound. A well-maintained MEMORY.md might be 2,000 tokens. A neglected one can reach 10,000, 30,000, or even 50,000 tokens.
  • Project context files. Any USER.md, IDENTITY.md, TOOLS.md, and other workspace files that get injected. Combined, these range from 500 to 4,000 tokens.
  • Available skills and skill definitions. Metadata about available skills, including brief descriptions. Usually 1,000 to 3,000 tokens.
  • Conversation history. Previous turns in the current session. This accumulates until compaction kicks in. A long-running session can accumulate 20,000 to 80,000 tokens of history.
  • Tool results. The output from each tool call during a turn. A single web fetch can return 5,000 tokens. A file read can return 3,000 tokens. A code execution can return 10,000+ tokens.
  • Current user message. The latest input, typically 100 to 2,000 tokens.

Summed together, a single request in a moderately complex session can easily consume 40,000 to 100,000 tokens of context. On a model with a 200,000 token context window, that leaves less than half the desk for the model’s actual reasoning work.

Why Context Window Size Affects Quality, Not Just Cost

The “lost in the middle” problem is well documented in LLM research. When information is buried deep in a long context, models are significantly less reliable at retrieving and reasoning about it. Information at the beginning and end of the context window gets the most attention. The middle becomes a blind spot.

This has real consequences for your agents. Your SOUL.md instructions at the start guide behavior well. The latest user message at the end is processed well. But a critical instruction you stored in MEMORY.md three months ago, sitting at token position 62,000? The model might skim right past it.

A 2024 study from Liu et al. showed that model performance on context-dependent tasks drops by as much as 40 percent when relevant information is placed in the middle of a long context versus the beginning or end. In practice, that means your agent might forget a key constraint, miss a user preference you explicitly stored, or fail to recall a revenue rule you set weeks ago.

Quality degradation from bloated context is insidious because it does not produce an error message. The agent simply performs worse, and you attribute it to model capability rather than context management.

The Hidden Cost: How MEMORY.md Silently Drains Your Budget

MEMORY.md is the single largest hidden cost center in any OpenClaw deployment. It loads on every single request, every turn, every time. Most operators set it once and never audit it.

Here is what that costs in real dollars.

Assume you run a Claude Sonnet 4.6 agent with a 30,000-token MEMORY.md. At $3 per million input tokens for Claude Sonnet 4.6, each turn costs $0.09 just for the memory file. If your agent runs at medium frequency — say 20 turns per day — that is $1.80 per day, $54 per month, $657 per year in tokens that do nothing but sit in the background.

Now scale that to three agents. That is $1,971 per year in memory overhead alone.

If your MEMORY.md grows to 50,000 tokens — easy to do in six months of daily operation — the same calculation becomes $0.15 per turn, $3.00 per day, $90 per month, $1,095 per year per agent.

And that is before you account for the conversation history, tool results, and other overhead that compound alongside it.

The obvious fix: keep MEMORY.md under 5,000 tokens for any agent that runs more than 10 turns per day. Below 2,000 tokens is ideal. If you need more persistent memory, use semantic retrieval instead of loading everything into context.

The Cache Boundary Trick: 40-60% Cost Reduction on Anthropic Models

OpenClaw has a little-documented feature that can slash your input costs on Anthropic models: the cache boundary comment.

Place this exact comment in your MEMORY.md file:

<!-- OPENCLAW_CACHE_BOUNDARY -->

Everything above the comment is treated as static cached content. Everything below it is dynamic. On Anthropic models that support prompt caching (Claude Sonnet 4.6, Claude Opus 4.x, and Claude Haiku 3.5+), content above the cache boundary is cached by Anthropic’s infrastructure and reuses a cache hit on subsequent requests.

The practical effect: if your MEMORY.md is 30,000 tokens but the part that actually changes is only the last 2,000 tokens (recent session notes), you can mark the upper 28,000 tokens as cacheable. Anthropic charges about 10 percent of the input token rate for cache reads, turning that $0.09-per-turn memory cost into roughly $0.01 for the cached portion plus $0.006 for the dynamic portion — a total of $0.016 per turn instead of $0.09.

Over 600 turns per month (20 per day), that shifts from $54 to about $9.60. A reduction of roughly 82 percent on memory cost specifically, and 40 to 60 percent on total input cost depending on your conversation-to-memory ratio.

To implement it, simply find the boundary between your stable memory content (identity, rules, long-term facts) and your volatile content (recent session logs, daily notes, transient observations). Cut with that comment. Re-test to confirm your agent still performs correctly.

Compaction: Trading History for Cost Efficiency

Conversation history is the second biggest context consumer. Every turn adds the assistant’s response and any tool results to the accumulated history. After 50 turns in a complex session, you can easily have 80,000 tokens of history pushing up against the context limit.

OpenClaw handles this through compaction. When conversation history exceeds a configurable threshold, the system compresses the oldest history into a summary, discarding the raw turn-by-turn detail in favor of a condensed version.

You configure this in openclaw.json under the agent’s config:

{
  "agents": {
    "my-agent": {
      "config": {
        "historySize": 100,
        "compactAt": 50
      }
    }
  }
}

The tradeoff is straightforward: earlier compaction means lower cost per turn but less detailed access to past interactions. If your agent needs to reference a specific user instruction from 40 turns ago, a compacted summary may not retain that detail.

For high-frequency agents (personal assistant, alert triage, chat support), set compactAt aggressively — 20 to 30 turns. The cost savings are substantial and the loss of historical detail rarely matters for these use cases. For research agents or debugging sessions, set it higher — 80 to 100 turns — or disable it entirely if the session is short-lived.

LanceDB vs. MEMORY.md: When to Switch

OpenClaw supports LanceDB as a semantic vector store for memory. Instead of loading your entire memory file into every request, the agent queries LanceDB for relevant memory chunks using semantic similarity. This is the difference between carrying your entire filing cabinet everywhere versus looking up only the files you need.

When should you switch?

  • Use MEMORY.md when your agent has fewer than 200 discrete memory entries, or when every turn needs access to the full memory context (for example, an agent that enforces global rules on every interaction).
  • Use LanceDB when your agent accumulates more than 200 memory entries, when memory grows faster than you can prune it, or when you are running multiple agents that could share a common knowledge base.

The cost equation is favorable for LanceDB at scale. A LanceDB lookup might cost 500 tokens per query (the query plus a few retrieved chunks), versus 30,000 tokens to load a bloated MEMORY.md. Even with the additional latency of a vector search, the per-turn savings cross the break-even point rapidly.

One caveat: semantic retrieval is not perfect. The vector search might miss a critical fact if the query embedding does not align well. For agents where recall precision is paramount, keep critical rules in MEMORY.md and migrate voluminous but lower-priority data to LanceDB.

lightContext for Subagents: Cutting Bootstrap Overhead

When OpenClaw spawns a subagent (for parallel work, long-running tasks, or specialized functions), the subagent inherits the full context of the parent session by default. That includes the entire MEMORY.md, all workspace files, and potentially tens of thousands of tokens of conversation history that the subagent does not need.

The lightContext parameter in sessions_spawn lets you strip this overhead:

sessions_spawn({
  label: "quick-lookup",
  prompt: "Find the Discord invite link from last week's notes",
  lightContext: true
})

With lightContext: true, the subagent starts with only its system prompt and the spawning message. It does not load MEMORY.md, workspace files, or parent conversation history. For a quick lookup or a high-volume task, this can reduce subagent cost by 70 to 90 percent.

Use lightContext for any subagent that runs more than a few turns, does not need full memory access, or performs the same task repeatedly. Reserve full context only for subagents that genuinely need it, such as those that perform complex multi-step research dependent on the parent session’s history.

Model Context Window Comparison (April 2026)

Model Context Window Input Cost per MTok Cache Read Cost per MTok
Claude Sonnet 4.6 200K tokens $3.00 $0.30
Claude Opus 4.5 200K tokens $15.00 $1.50
Claude Haiku 3.6 200K tokens $0.25 $0.025
GPT-4o 128K tokens $2.50 $1.25
GPT-4o Mini 128K tokens $0.15 $0.075
DeepSeek V3 128K tokens $1.00 N/A
Llama 3.3 70B 128K tokens $0.59 N/A
Gemini 2.0 Flash 1M tokens $0.10 $0.025

Pricing shown is approximate as of April 2026. Cache pricing varies by provider and may require minimum cache sizes or TTL commitments. Gemini’s 1M token context is not available on every deployment configuration and is limited to Google’s platform.

The Optimal Configuration for Different Agent Types

There is no single right answer for context window configuration. The optimal setup depends entirely on what your agent does.

High-frequency personal assistant (50+ turns/day). Keep MEMORY.md under 2,000 tokens. Enable the cache boundary aggressively — put everything except recent session notes above the line. Set compactAt to 20 turns. Use LanceDB for long-term memory. Use lightContext for all subagent spawns. This agent type is the most sensitive to per-turn overhead because it runs hundreds of times per week. Even small per-turn savings compound rapidly.

Research agent (5-15 turns/day, deep sessions). Keep MEMORY.md at 5,000 to 8,000 tokens with structured sections. Use the cache boundary to cache stable identity and methodology sections. Set compactAt to 80 turns or higher. Avoid LanceDB for this type — research agents need the full context to weigh evidence across sources. Consider using a larger model (Claude Opus, GPT-4o) for the reasoning quality and accept the higher per-token cost.

Cron job / scheduled agent (1-3 turns/day). Keep MEMORY.md up to 10,000 tokens if needed — the low turn count means per-turn savings matter less. The cache boundary still helps on cached providers. Compaction is mostly irrelevant since sessions are short. Use the cheapest adequate model (Claude Haiku, GPT-4o Mini, DeepSeek V3) since throughput is low and cost per request is what matters, not cost per turn.

Multi-agent system (10+ agents, automated workflows). This is where context costs compound fastest. Standardize on MEMORY.md hygiene across all agents — set a 3,000-token hard limit. Enable cache boundaries on every agent. Route all subagent spawns through lightContext by default. Consider centralizing shared knowledge in LanceDB so each agent does not carry its own heavyweight memory file. Audit total context cost weekly, not monthly. At scale, a 10-agent system burning $0.09 per turn on memory per agent at 20 turns per day equals $18 per day, $540 per month, solely on memory overhead.

Sources

  • Liu, N. F., et al. “Lost in the Middle: How Language Models Use Long Contexts.” arXiv:2307.03172, 2024.
  • Anthropic. “Prompt Caching.” Anthropic Documentation, accessed April 2026.
  • OpenClaw. “Memory and Context Configuration.” OpenClaw Agentic Framework Documentation, 2025-2026.
  • Anthropic. “Claude Sonnet 4.6 System Card & Pricing.” April 2026.

Related Reading on RedRook:

Similar Posts