You turned on prompt caching and your costs barely moved. Prompt caching does not automatically cache everything in your context. It only caches what is identical between calls, and only if that content is long enough to cross the minimum threshold. This article shows you how to check your current cache hit rate, why most setups get poor results by default, and the one structural change that fixes the majority of cache miss problems.
TL;DR
Poor cache hit rates almost always come from one of three causes: dynamic content appears too early in your context and breaks the cache before stable sections are reached, your stable sections are too short to cross the minimum caching threshold, or your caching mode is set to full on a setup where the context changes frequently mid-session. The fix for the first cause is structural and takes about five minutes: move anything that changes between turns to the bottom of your context. The fix for the second is adding more stable content at the top. The fix for the third is switching to short mode.
Every indented block in this article is a command you can paste directly into your OpenClaw chat. Your agent will run it and report back. You do not need to open a terminal, edit any files manually, or navigate any filesystem.
What prompt caching actually does
Every time your OpenClaw agent responds, it sends your entire context to the AI model. The context includes your system prompt, any workspace files that load at session start, the conversation history so far, and the definitions for any tools the agent can use. On a busy session, this can be tens of thousands of tokens. The model processes all of it on every turn to generate each response.
Prompt caching is a provider feature that lets the model skip reprocessing parts of the context it has already seen. If a large section of the context is byte-for-byte identical to the previous call, the model uses the processed version it stored from the last call instead of processing it again. You pay a reduced per-token rate for the cached portion, typically 50-90% less than the standard rate depending on the provider.
The requirement that trips up most operators is the word “identical.” Not similar. Not semantically equivalent. Byte-for-byte identical. If anything in the cached section changes, even a single character, the cache breaks at that point. Everything after the changed character also misses. This is why context structure matters so much: you want all the content that changes to come after all the content that stays the same.
Look at how my context is currently structured. What comes first in my system prompt and workspace files? Are there large sections that stay the same between every turn, or does the content shift around? Tell me specifically where the stable sections are and where the dynamic content appears relative to them.
How to check your current cache hit rate
Before changing anything, find out whether caching is actually working. OpenClaw tracks cache hits in the session metadata. Your agent can report this directly.
Check my current session statistics. What is the ratio of cached tokens to total input tokens in recent turns? Is prompt caching enabled, and what mode is it set to? Are we getting meaningful cache hits, or is the cache miss rate high?
What to look for in the output:
- Cache hit ratio above 60%: caching is working well. Your stable content is positioned correctly and long enough to meet the threshold.
- Cache hit ratio between 20% and 60%: caching is partially working. Something in the early part of your context is changing between turns and breaking the cache before it reaches your longer stable sections.
- Cache hit ratio below 20% or zero: caching is not working. Either it is disabled, the stable sections are too short to meet the minimum threshold, or dynamic content is breaking the cache very early.
Provider differences in caching
As of March 2026, prompt caching is supported by Anthropic (all Claude models), OpenAI (GPT-4o and newer), and some models via OpenRouter. DeepSeek’s API does not currently support prompt caching in the same way. If your primary model is DeepSeek, caching will not reduce your input token costs regardless of how your context is structured. Check which model you are using before spending time on caching optimization.
The three causes of poor cache hit rates
Cause 1: dynamic content appears too early
The most common cause. Your system prompt or one of the early workspace files includes something that changes between turns: the current date and time, today’s active task list, the latest memory recall results, a recent news summary. That changing content breaks the cache at the point where it appears. Everything after it in the context also misses because the cache only works on content that matches from the beginning up to the break point.
Here is a concrete example. Suppose your system prompt says “Current date and time: Tuesday March 24, 2026 4:00 AM” near the top. The next turn, that timestamp has changed. The cache breaks immediately. Your 8,000-token SOUL.md and AGENTS.md that come later in the context never get a chance to be cached, even though they are identical between turns. You pay full price for all of it.
Read my system prompt and all workspace files that load at session start. Find any content that changes between sessions or between turns within a session: current date, current time, active tasks, recent memory recalls, today’s news, heartbeat results, any value that gets updated regularly. List every section that contains dynamic content and tell me where in the context order it currently appears.
Once you have that list, the fix is simple: move everything dynamic to the end. The structure you are aiming for is:
- All permanent instructions (SOUL.md, AGENTS.md, persona files)
- All stable workspace files (INFRASTRUCTURE.md, TOOLS.md, any reference files that rarely change)
- Tool definitions (these are injected by OpenClaw and stay stable)
- Dynamic content: current date, active tasks, memory recall results, anything that changes
- Conversation history (this is always at the end)
Based on what you found about my dynamic content, what changes would I need to make to move all dynamic sections to the end of my context? Show me the current order and the recommended order. If any of the dynamic content is injected automatically by OpenClaw rather than coming from my files, tell me that too so I know what I can and cannot control.
Cause 2: stable sections are too short
Prompt caching has a minimum token threshold before it activates. If your stable content does not meet this threshold, no caching happens regardless of how perfectly structured your context is. As of March 2026, Anthropic’s minimum cacheable block is 1,024 tokens for their standard models. OpenAI’s threshold varies by model.
If your system prompt is a few short paragraphs and your workspace files are minimal, you may simply not have enough stable content to trigger caching. The solution is not to pad your files with filler. It is to add genuinely useful stable content: detailed instructions that your agent needs to follow consistently, thorough documentation of your infrastructure, comprehensive tool guides, or reference files that your agent reads at every session.
How long is my current system prompt in tokens? How many tokens do my stable workspace files add at the start of each session? Is the total large enough to meet the caching minimum threshold for the models I am using? If not, what legitimate content could I add to reach the threshold?
Do not pad to hit the threshold
Adding empty space, repeated content, or meaningless instructions just to reach the caching threshold wastes tokens every turn regardless of whether caching is working. It increases the size of every prompt and costs more than it saves. If your genuine stable content is genuinely short, caching is not the right optimization for your setup. Model routing to cheaper or local models will have higher ROI for small-context setups.
Cause 3: wrong caching mode for your setup
OpenClaw offers two caching modes: short and full. Understanding the difference matters for choosing the right one.
short mode caches the system prompt and the beginning of the conversation context. It targets the highest-value stable section: the system prompt and workspace files that load every session. This is the right choice for most setups because the system prompt is almost always the largest stable block and the one most worth caching.
full mode attempts to cache more of the context, including longer portions of the conversation history. This sounds better but it is often worse in practice. Conversation history changes every turn by definition. If OpenClaw tries to cache a large portion of the conversation and most of it changes between turns, the cache miss rate is high and the overhead of attempting to cache content that does not match costs slightly more than not caching at all.
What is my current prompt caching mode set to? If it is set to full, check whether my conversation history typically contains long stable sections between turns, or whether it changes substantially each turn. Based on that, would short mode likely give better cache hit rates for my typical usage pattern?
For almost every standard OpenClaw setup, short mode is the correct choice. Switch to it and leave it unless you have an unusual setup where long sections of your conversation history genuinely stay identical across many turns.
Verifying the fix worked
After making structural changes to your context order or switching caching modes, run a few turns and then check the cache hit rate again. The improvement should be visible within the same session.
After the last few turns we have had in this session, what is the current cache hit ratio? Is it higher than before we made the structural changes? Which sections of the context are getting cache hits now, and is there anything still breaking the cache that we have not fixed yet?
If the cache hit rate improved but is still below 60%, there is likely one more dynamic section you have not moved yet. Ask your agent to trace exactly where in the context the cache breaks on the most recent turn, and move whatever is there.
What caching is and is not worth optimizing for
Prompt caching is a meaningful cost reduction if your system prompt and stable workspace files are large. An operator with a 10,000-token system prompt and 5,000-token workspace files who gets a 70% cache hit rate is saving a significant amount per turn compared to an operator with a 2,000-token system prompt who gets 80% cache hits. The absolute savings depends on the size of the stable block, not just the hit rate.
For operators with small context setups, caching optimization has diminishing returns. If your total stable content is under 2,000 tokens, the savings from caching are modest even with a perfect hit rate. In that case, model routing to local models or cheaper API models will have a much higher impact on your total costs than caching structure.
Compare my two biggest cost reduction opportunities right now: prompt caching optimization versus model routing to cheaper or local models. Based on my current context size, cache hit rate, and model usage, which one would save me more per day? Give me a rough estimate for each.
What your provider charges for cache hits versus misses
The cost structure for cached versus uncached tokens varies by provider. As of March 2026:
- Anthropic (Claude models): cache writes cost approximately 25% more than standard input token pricing. Cache reads cost approximately 10% of standard input token pricing. Net result: after the first cache write, every subsequent cache hit pays for itself very quickly. Cache writes are a one-time cost per session. Cache reads recur every turn.
- OpenAI (GPT-4o and newer): cached tokens are charged at 50% of the standard input token rate. There is no additional write cost. Simpler math, slightly lower savings than Anthropic’s model.
- DeepSeek: does not use the same caching mechanism as of March 2026. The per-token cost is lower than the other providers to begin with, so caching optimization is less relevant for DeepSeek-based setups.
Check your provider’s current pricing
Cache pricing is one of the more frequently updated areas of AI provider pricing. The figures above are accurate as of March 2026 but are worth verifying directly on your provider’s pricing page before making optimization decisions based on specific numbers. Paste “What are the current cached versus uncached token prices for [provider name]?” into your agent and it will look it up for you.
Provider-specific caching nuances
Not all AI providers implement prompt caching the same way. The differences affect how you should structure your context and what you can expect in terms of savings.
Anthropic (Claude models)
Anthropic’s caching is the most sophisticated of the major providers as of March 2026. It uses a two-part pricing model: cache writes cost more than standard input tokens, cache reads cost significantly less. The cache is session-scoped and resets when the session ends. The minimum cacheable block is 1,024 tokens. The cache is byte-for-byte identical matching, not semantic.
Anthropic’s caching works best when your stable context is large and your sessions are long. The cache write cost is amortized across many cache reads. For a 15,000-token system prompt with a 30-turn session, the first turn pays the cache write premium, the next 29 turns pay the low cache read rate. The net savings are substantial.
OpenAI (GPT-4o and newer)
OpenAI’s caching is simpler: cached tokens are charged at 50% of the standard input token rate. There is no separate write cost. The cache is also session-scoped. The minimum cacheable block varies by model but is typically in the 1,000-2,000 token range.
OpenAI’s model is easier to reason about but offers lower absolute savings than Anthropic’s model for long sessions. The 50% discount applies to every cached token, not just reads after a write. This means the first turn also gets the discount if the content matches a previous session’s cache (which it rarely does because sessions are independent). In practice, OpenAI caching saves less per session than Anthropic caching for the same stable block size and turn count.
DeepSeek
As of March 2026, DeepSeek’s API does not support prompt caching in the same way as Anthropic and OpenAI. The per-token pricing is lower to begin with, which reduces the incentive for caching optimization. If your primary model is DeepSeek, focus on model routing to cheaper models within the DeepSeek family rather than caching structure.
OpenRouter
OpenRouter passes through caching support from the underlying provider. If you are using OpenRouter to access Claude models, you get Anthropic’s caching behavior. If you are using it to access GPT-4o, you get OpenAI’s caching behavior. The caching settings in OpenClaw apply to OpenRouter connections the same way they apply to direct provider connections.
Which AI providers am I currently configured to use in OpenClaw? For each provider, does it support prompt caching? If yes, what is the caching pricing model and minimum block size? If no, what is the alternative cost reduction strategy I should focus on instead?
Check your provider’s documentation
Caching implementation details change more frequently than standard pricing. The information above is accurate as of March 2026 but should be verified against your provider’s current documentation. Paste “What is the current prompt caching pricing and behavior for [provider name]?” into your agent and it will look up the latest information for you.
Common questions
How do I know if prompt caching is actually enabled on my setup?
Paste “Is prompt caching currently enabled in my OpenClaw configuration? What mode is it set to? How do I check whether it is actually producing cache hits in my current session?” into your agent. Your agent will read the config and check the session metadata. If caching is enabled but showing zero cache hits after several turns, something structural is preventing it from working, and the earlier sections of this article cover the likely causes.
Does prompt caching work with local Ollama models?
No. Prompt caching as described in this article is a feature of hosted API providers: Anthropic, OpenAI, and similar. Local Ollama models do not charge per token, so caching is not relevant to them from a cost perspective. Ollama does maintain a model context in memory between calls (controlled by the OLLAMA_KEEP_ALIVE setting), which means it does not reload the model for each request, but this is not the same as prompt caching and does not reduce the computation cost of processing a large context window.
My cache hit rate improved but my bill did not change. Why?
Cache hits reduce input token costs. If your costs are driven by output tokens rather than input tokens, better cache hit rates will not move your bill much. Output tokens are not cached because they are generated fresh each turn. If you have long responses and short prompts, output token optimization (through shorter, more targeted prompts) will have more impact than caching structure. Paste “What percentage of my recent API costs come from input tokens versus output tokens?” into your agent to find out which side is driving your spend.
Can I force OpenClaw to cache specific sections of my context?
Caching is handled by the provider, not by OpenClaw directly. As of March 2026, Anthropic allows explicit cache control hints that you can add to your messages to tell the model which sections to prioritize for caching. OpenClaw’s short and full modes configure how aggressively it tries to cache. Beyond those settings, the most effective thing you can do is structural: keep stable content early, keep dynamic content late, and ensure your stable block is large enough to meet the minimum threshold.
What happens if I switch from full to short caching mode? Will anything break?
Switching modes only affects cost and cache hit rates. Nothing in your agent’s behavior changes. The agent still has access to the same context. If you were relying on full mode to cache long conversation histories, switching to short mode means those sections are no longer cached, which means they cost the standard input token rate again. But as noted earlier, caching long conversation histories is often counterproductive because they change every turn. Most operators switching from full to short see their cache hit rate go up, not down.
Does the context checkpoint file affect caching?
Yes. If your agent writes a context checkpoint file and reads it at the start of each session, that file is part of your stable context as long as the content does not change between sessions. A checkpoint that gets updated every session means the section of the context that loads that file changes, which breaks the cache at that point. If caching is a priority, consider loading the checkpoint only when it changes significantly rather than loading it every session regardless of whether anything has changed.
How much can I realistically expect to save by optimizing caching?
The realistic savings range is wide because it depends on three variables: the size of your stable context, the cache hit rate you achieve, and which provider you use. For an operator with a 15,000-token stable system prompt who goes from 10% to 70% cache hits using Anthropic Claude, the per-session savings in input token costs can be substantial. For an operator with a 2,000-token system prompt and mostly short conversations, the savings from caching optimization are minor regardless of the hit rate. The honest answer is: run the comparison prompt from the “What caching is worth” section above and let your agent calculate the actual numbers for your specific usage.
I restructured my context and my cache hit rate went up, but I am now getting slightly worse responses. What happened?
Moving dynamic content to the end of the context changes what the model sees first when processing your prompt. If you moved something that gave the model important context for the current turn to a later position, the model may be responding with less relevant information because it is processing the stable historical content before the current-turn context. If this happens, check whether the dynamic content you moved includes anything the model genuinely needs early in the prompt to do its job. Time-sensitive instructions or current-task context may need to stay near the top even if it costs cache hits. The goal is cost efficiency without quality degradation, not cache hits at any cost.
Measuring your actual savings from caching
Knowing your cache hit rate is not the same as knowing how much money caching is actually saving you. Cache hits reduce input token costs. Output tokens are never cached. If your typical session is input-heavy, caching has a large impact. If it is output-heavy, the savings are smaller than the hit rate suggests.
For my recent sessions, what is the breakdown of input tokens versus output tokens? Of the input tokens, approximately what percentage were served from cache versus processed fresh? Based on the current pricing for my model, how much did caching save compared to what it would have cost without caching?
If your agent cannot retrieve precise token breakdowns from the session metadata, a reasonable estimate is possible from the session configuration. An operator with a 15,000-token system prompt and workspace context running Claude Sonnet at 30 turns per session can estimate cache savings by calculating the difference between: (15,000 tokens x 30 turns x standard input rate) versus (15,000 tokens x 1 cache write + 15,000 tokens x 29 cache reads x cache read rate). The difference is significant once the stable block is large.
The practical takeaway: caching optimization pays off most when your stable context is large and your session turn count is high. A five-turn session with a small system prompt sees modest savings. A 40-turn session with a large system prompt can see 40-60% input token cost reduction from caching alone.
How session restarts affect your cache
Prompt caching is session-scoped at the provider level. Every time a new session starts, the cache begins fresh. There is no persistent cache that carries over from yesterday’s session to today’s. This means the cache write cost happens once per session, and the cache read savings accumulate across the turns within that session.
Long sessions benefit more from caching than short ones. A session with 2 turns gets one cache write and one cache read on the stable block. A session with 30 turns gets one cache write and 29 cache reads. The per-session ROI is proportional to the number of turns after the first.
Cron jobs and caching
Isolated cron jobs (sessionTarget: “isolated”) start a fresh session for each run. There is no multi-turn cache accumulation for isolated jobs. Each run pays the cache write cost and gets zero cache read benefit. Caching optimization for isolated cron jobs is therefore much less valuable than for long interactive sessions. If your primary cost driver is cron job API calls, model routing to local models (from the Spend Limits article) will have far higher impact than caching structure.
Look at my usage pattern. Do most of my API costs come from long interactive sessions, or from many short isolated cron job sessions? Based on that, is prompt caching optimization or model routing a higher priority for reducing my costs?
A practical walkthrough: fixing a low-cache-hit setup
The following is what the process looks like for a typical operator who enabled caching but is getting poor results. Walk through this sequence in order.
Step 1: establish your baseline
Tell me my current cache hit rate based on recent session data. What is my system prompt length in tokens? What workspace files load at session start and how long is each one? Is anything in those files dynamic, meaning it changes between sessions?
Write down the numbers: total stable tokens, current hit rate, rough session turn count. These are your before numbers.
Step 2: identify what is breaking the cache
Go through my system prompt and each workspace file that loads at session start. For each section, tell me: does this content stay the same between sessions, or does it change? If it changes, what specifically changes and how often?
What you are looking for: any file or section that includes timestamps, recent events, current task status, memory summaries, or anything else that gets updated regularly. Those are your cache-breaking sections.
Step 3: restructure
Based on what you found: which sections should load first for maximum cache hits, and which should load last? If any dynamic content is currently embedded inside a stable file, does it need to be separated into its own file so the stable portion can be cached independently?
Apply the changes your agent recommends. You may need to split a file that currently mixes stable and dynamic content into two files: one that is entirely stable and loads early, one that is entirely dynamic and loads last.
Step 4: verify and measure
Run a normal session for a few turns, then check the cache hit rate again. Compare to your baseline. If the rate improved significantly, the restructuring worked. If it improved only slightly, there is likely one more dynamic section you missed. Ask your agent to trace exactly where the cache breaks on the most recent turn.
Based on the last few turns in this session, where exactly is the cache breaking? Is there still any dynamic content appearing before the large stable sections? What would the estimated cost savings be per session if the cache hit rate stayed at its current level for a typical 20-turn session?
When to stop optimizing
A 60-70% cache hit rate is a realistic and good result for most setups. Chasing 90%+ often requires removing all dynamic context from early in the prompt in ways that reduce output quality. If your agent is producing worse responses after caching optimization, you have gone too far. Roll back until the quality is restored, accept the slightly lower cache hit rate, and treat the quality-adjusted hit rate as your practical ceiling.
My system prompt is split across multiple files. Does that affect caching?
Yes, but only if the files are loaded in a different order between sessions. Caching works on the concatenated context in the order it appears. If your system prompt loads SOUL.md then AGENTS.md then USER.md in that order every session, the combined block is stable and cacheable. If the order changes, the combined block changes and the cache breaks. The fix is to ensure your workspace file loading order is consistent. Paste “In what order are my workspace files loaded at session start? Is that order consistent across sessions?” to check.
Does caching work with streaming responses?
Yes. Caching affects input token processing, not output generation. Streaming responses still benefit from cached input tokens. The cache hit rate is calculated the same way regardless of whether the response streams or arrives as a single block.
Can I see which specific sections of my context are being cached?
OpenClaw does not currently expose a per-section cache breakdown. You can infer which sections are being cached by checking where the cache breaks. Ask your agent: “Based on the most recent turn, at what position in the context did the cache break? What content appears at that position?” That tells you what changed and broke the cache. Everything before that position is being cached successfully.
Does the model matter for caching effectiveness?
Yes. Larger models with higher per-token costs benefit more from caching in absolute dollar terms. Claude Opus at $15 per million input tokens sees larger absolute savings from a 70% cache hit rate than Claude Haiku at $0.25 per million. The percentage savings are similar, but the dollar impact is greater with expensive models. If you are using a flagship model, caching optimization is worth more of your time.
What happens if I update my system prompt mid-session?
The cache breaks at the point where the updated content appears. If you change a single sentence in your 10,000-token system prompt, the cache breaks at that sentence. Everything before it remains cached, everything after it misses. This is why it is better to update system prompts between sessions rather than during them: you avoid paying the cache write cost twice in the same session.
The one-hour caching optimization checklist
If you have one hour to spend on caching optimization, follow this sequence. It covers the highest-impact changes in the right order.
- Check your current cache hit rate (5 minutes). Paste the command from the “How to check your current cache hit rate” section. Write down the number.
- Identify dynamic content (10 minutes). Paste the command from the “Cause 1: dynamic content appears too early” section. Make a list of everything that changes between turns.
- Move dynamic content to the end (15 minutes). Restructure your context so all dynamic content loads after all stable content. This is the single most impactful change for most setups.
- Verify caching mode (2 minutes). Check that caching is set to short mode, not full, unless you have a specific reason for full mode.
- Run a test session (10 minutes). Have a normal conversation for 5-6 turns, then check the cache hit rate again. Compare to your baseline.
- Calculate the savings (5 minutes). Use the command from the “Measuring your actual savings from caching” section to estimate the per-session dollar impact.
- Decide whether to continue (3 minutes). If your cache hit rate is now above 60% and the savings are meaningful for your usage, stop. If not, trace where the cache is still breaking and consider whether fixing it is worth the quality tradeoff.
That sequence takes under an hour and addresses the majority of caching problems. The remaining optimizations (splitting files, adjusting minimum block sizes, provider-specific tuning) have diminishing returns. For most operators, the checklist above is enough.
Based on everything we have covered in this article, what is the single most impactful caching optimization I could make right now? How long would it take, and what is the estimated per-session savings after making it?
Cheap Claw
Every cost lever, ranked by impact
The complete spend reduction playbook for OpenClaw operators. Model routing, prompt caching, context window sizing, compaction settings, and the full fallback chain. Drop it into your agent and it reads the guide and makes the changes. Operators report 60-80% spend reduction within a week.
