I turned on prompt caching and my bill didn’t change at all

Prompt caching has no feedback loop. When it is working, nothing tells you. When it is not working, nothing tells you either. There is no log entry, no dashboard indicator, no error. You enable it and either your bill drops over time or it stays flat, and you have no way to know which is happening or why. Here is how to figure out which situation you are in, and what to do about it. The fix is usually one of three things.

TL;DR: Prompt caching only works on Anthropic direct API, only saves money in long sessions, and only fires reliably when your system prompt prefix is stable. If your bill has not moved after enabling it, one of those three conditions is not met. This article shows you how to diagnose which one and fix it.

Before you start: The blockquotes throughout this article are commands you paste directly into your OpenClaw chat. Your agent will run them and report back. You do not need to open a terminal or edit any files manually. Manual fallbacks are provided in blue boxes wherever a config file edit is needed.

What prompt caching actually does

Every time your agent responds, it re-reads its entire system prompt from scratch. That system prompt (your SOUL.md, your AGENTS.md, your rules and context) costs tokens every single turn. Even if nothing in it changed from the previous turn.

Prompt caching tells the AI provider: remember the top portion of this prompt. Next turn, instead of charging the full input rate to send those same tokens again, charge a small fee to read them from cache.

The economics on Anthropic direct API (the only major provider that reliably supports this as of March 2026):

Cache write: 25% more than standard input tokens. You pay a premium on the first call to store the prefix.
Cache read: 10% of standard input tokens. Subsequent calls that hit the cache pay only a tenth of the normal rate for those tokens.

If your system prompt is 5,000 tokens and your session runs 20 turns, you pay the write cost once (6,250 tokens equivalent) and the read cost 19 times (500 tokens equivalent per turn). The breakeven is roughly 3 turns for a stable 5,000-token prompt. Beyond that, every turn saves money.

That math only holds when three conditions are met. If any one of them is missing, caching does nothing, or in some cases costs slightly more than not having it enabled.

The three conditions caching requires

Condition 1: You are on Anthropic direct API

As of March 2026, prompt caching is supported by Anthropic directly. OpenAI, DeepSeek, Mistral, and local models do not support it. OpenRouter passes Anthropic requests through to Anthropic’s infrastructure. OpenRouter’s documentation states that cache headers are passed through for Anthropic models, but some users report inconsistent behavior. The only way to verify is to inspect raw API responses or switch to Anthropic direct API and compare. If you need reliable caching, Anthropic direct API is the only path where it is confirmed to work consistently.

OpenRouter note: OpenRouter can route to Anthropic models. Whether it passes Anthropic’s cache control headers through correctly depends on the OpenRouter version and request format. If you are on OpenRouter and caching is not working, switching to Anthropic direct API is the fastest diagnostic step. If caching works on direct and not on OpenRouter, the issue is the passthrough, not your config.

Condition 2: Your sessions are long enough

Cache savings accumulate within a session, not across sessions. Every new session (every /new or gateway restart) resets the cache and pays the write cost again. If your average session is under 5 turns, the write cost on turn 1 exceeds the read savings on turns 2-4. You need at least 3-5 turns at minimum for caching to break even on a typical system prompt. For sessions of 10+ turns, the savings are meaningful.

Condition 3: Your system prompt prefix is stable

The cache key is the exact token sequence at the top of your system prompt. If anything there changes between turns (injected timestamps, memory recalls, session state, dynamic context), the cache key changes, the cache misses, and you pay the write cost again for every turn.

This is the condition that most commonly breaks caching without any indication that it is broken. You enable caching, your sessions run 20 turns, and you are paying the write cost on every single turn because the first 500 tokens of your system prompt include a timestamp injection that changes each time.

Step 1: Check your provider

Read my openclaw.json. Tell me which provider my default model routes through. Is it direct Anthropic API, OpenRouter, or something else? Also show me what value promptCaching is currently set to.

Manual fallback: Open ~/.openclaw/openclaw.json. Look for agents.defaults.model. If it starts with anthropic/, you are on Anthropic direct. If it starts with openrouter/, you are via OpenRouter. Also find promptCaching. The valid values are "none", "short", and "full".

If you are not on Anthropic direct, caching is either unavailable or unreliable. Your options: switch to Anthropic direct API for the tasks where caching would help most, or accept that caching is not a tool available in your current setup and focus on routing instead.

Step 2: Check your prompt prefix for dynamic content

Read my SOUL.md and any other files that load into my system prompt at startup. Tell me: what appears near the very top, and does any of it change between turns? Look specifically for: injected timestamps, memory recall blocks, session IDs, dynamic context, or anything that would produce different text from one turn to the next.

Common culprits:

Injected timestamps. Some setups inject the current date and time at the top of each prompt. This changes every turn and breaks the cache key.
Memory recall blocks. If your memory plugin injects recent memories at the top of the context, those blocks change as memories are added. Move the injection point lower.
Session state. Some protocols inject the current task, active session, or last action at the beginning of the prompt. This changes frequently and costs you cache hits.
Webhook or event data. If your setup injects incoming message content into the system prompt, that is different every turn.

The fix is structural: put stable content first, dynamic content last. Your persona, rules, workspace structure, and constraints should come before any injected context. The cache covers the stable prefix up to the first point of change. Everything after that point is not cached.

Review the structure of my system prompt. Propose a reordered version where stable content (persona, rules, constraints) appears at the top and dynamic content (memory injections, timestamps, session state) appears at the bottom. Show me the proposed reorder before making any changes to my files.

Ask to see the changes before they are written. Moving content to a different position in the system prompt can affect how your agent prioritises rules. Review the proposed reorder before applying it.

Step 3: Check your session length

If your provider is Anthropic and your prompt prefix is stable, the remaining variable is session length. Check what your typical session looks like:

Review the last 5 sessions in my session history. On average, how many turns does each session run before I start a new session or use /new? Give me the range and median.

If your median session is 3 turns or fewer, caching is not worth enabling for interactive use. The write cost exceeds the read savings at that volume. For long-running sessions (10+ turns), caching makes a meaningful difference.

One exception: cron jobs and pipelines. If you have cron jobs that run 20-30 turns before completing, those are excellent caching candidates even if your interactive sessions are short. The key is whether the individual session is long enough, not whether all sessions are.

Step 4: The right setting for most setups

promptCaching: "short" caches your system prompt prefix with no additional configuration. This is the right starting point for most operators. It covers the highest-value use case (expensive system prompts that repeat across many turns) without requiring any cache control setup.

promptCaching: "full" caches more of the conversation context, including tool results and intermediate reasoning steps. It requires explicit cache control breakpoints to work correctly. If misconfigured, it does nothing useful. Do not use "full" until you have confirmed "short" is working and you have a specific reason to go further.

Set promptCaching to “short” in my openclaw.json. Restart the gateway. Then read the config back and confirm the setting is applied.

Manual fallback: Open ~/.openclaw/openclaw.json. Find "promptCaching" and set it to "short". If the field is missing, add it at the top level of the config object. Save the file, then restart OpenClaw: sudo systemctl restart openclaw on Linux (system service), systemctl --user restart openclaw on Linux (user service), or stop and restart the process on macOS.

How to verify caching is actually firing

After enabling caching and restructuring your prompt prefix, verify it is working. Anthropic’s API response includes cache read and write token counts in the usage object. OpenClaw does not surface these in the chat UI, but your agent can retrieve them from the raw API response.

After your next response, tell me: what did the Anthropic API return for cache_creation_input_tokens and cache_read_input_tokens in the usage object? If these fields are not accessible, tell me that. If your agent cannot retrieve cache token counts, run a 10-turn test session with caching enabled and compare the estimated token usage to a similar session without caching. The difference should be visible in your provider dashboard’s usage breakdown.

What the numbers mean:

cache_creation_input_tokens > 0, cache_read_input_tokens = 0: Caching is writing but nothing is being read back yet. Normal for the first turn of a session.
cache_read_input_tokens > 0 on turns 2+: Caching is working. Those tokens were served from cache at 10% of normal input cost.
Both are 0 after several turns: Caching is not firing. Either you are not on Anthropic direct, your prompt prefix is unstable, or the cache setting did not apply.

If both values are 0 and you are on Anthropic direct: The most common cause is that the cache setting did not take effect because the session started before the config change. Start a fresh session with /new after the gateway restart, then check again.

When to turn caching off

Three situations where caching is costing you rather than saving you:

Short interactive sessions

If your average session is 3 turns or fewer, the write cost consistently exceeds read savings. Turn it off: set promptCaching to "none".

Dynamic prompt prefix

If you cannot restructure your system prompt to put stable content first because your setup depends on dynamic injections near the top, caching will pay the write premium on every turn and return nothing. Turn it off until you can fix the structure.

Non-Anthropic models

If your setup routes primarily to DeepSeek, OpenAI, or local models, caching has no effect on those calls. The setting is harmless but pointless. Turn it off to avoid confusion when auditing your config.

Set promptCaching to “none” in my openclaw.json and restart the gateway.

How caching interacts with model routing

Caching and routing are complementary, not substitutes. Routing reduces the number of expensive model calls. Caching reduces the cost per expensive model call. If you are using Anthropic models for your complex tasks (which is appropriate), applying caching to those sessions compounds the savings from routing.

Practical ordering: fix routing first, then add caching on top. Routing has broader impact: it applies to all providers and all task types. Caching only helps on Anthropic calls in long sessions. Get the routing right first, then layer caching onto the Anthropic calls that remain.

The math: when does caching actually save money?

Before configuring caching, it is worth understanding the exact breakeven point for your setup. The numbers are simple and the calculation takes less than a minute.

Using Anthropic Sonnet 4 pricing as of March 2026 ($3.00 per million input tokens):

Standard input rate: $0.000003 per token
Cache write rate: $0.000003 × 1.25 = $0.00000375 per token (25% premium)
Cache read rate: $0.000003 × 0.10 = $0.0000003 per token (90% discount)

For a 5,000-token system prompt prefix:

Without caching, per turn: 5,000 × $0.000003 = $0.015
With caching, turn 1 (write): 5,000 × $0.00000375 = $0.01875
With caching, turn 2+ (read): 5,000 × $0.0000003 = $0.0015

The breakeven is at turn 2. After just two turns, caching has paid for its write premium and starts saving money on every subsequent turn. A 20-turn session with a 5,000-token prefix saves:

Without caching: 20 turns × $0.015 = $0.30 in system prompt input tokens
With caching: $0.01875 (write) + 19 × $0.0015 (reads) = $0.01875 + $0.0285 = $0.04725
Savings: $0.2528 on one 20-turn session, on system prompt tokens alone

These numbers are per session, and they compound. Scale that to 10 long sessions per day: $2.53/day, or $75.83/month, just from prompt caching on a 5,000-token system prompt. For operators with 20,000-30,000 token system prompts, the savings scale linearly.

The reverse calculation also matters. For short sessions, the write premium is proportionally larger relative to the few read savings you get. For a 3-turn session with the same 5,000-token prompt:

Without caching: 3 × $0.015 = $0.045
With caching: $0.01875 + 2 × $0.0015 = $0.0218
Savings: $0.023 per session

Even at 3 turns, caching saves money. The breakeven is at turn 2, not 5 as commonly stated. The 5-turn rule of thumb is conservative. For any session that runs more than 2 turns with a stable prefix on Anthropic direct API, caching pays off. The longer and more frequent your sessions, the larger the cumulative benefit. Daily active operators see the most consistent payoff.

Note on these numbers: Pricing changes over time. The calculation above uses March 2026 Anthropic Sonnet 4 pricing. The ratios (25% premium on write, 90% discount on read) are stable features of Anthropic’s caching design, but the base price per token varies by model. Check platform.anthropic.com/pricing for current rates before budgeting.

What breaks the cache key: a complete list

Any of these in the stable prefix portion of your system prompt will invalidate the cache key and cause a cache miss on every turn:

Dynamic content injected at the top of context

Current time or date. If your system prompt starts with “Current time: Tuesday March 22 2026 9:47 PM UTC” (as many OpenClaw setups do), this changes every turn and breaks the cache key. Move date/time injections to the bottom of the context, or accept that your caching will not fire on that session layer. The date/time injection is often added by OpenClaw itself as system metadata.

Memory recall blocks. If your memory plugin injects “Relevant memories: [list of memories]” near the top of each turn, the memory list changes as new memories are added. This is one of the most common caching killers. Move memory injections below your static config sections.

Session or conversation ID. Some setups inject the current session or conversation identifier. This changes every session and prevents caching from working across any turns in the session.

Inbound message metadata. If OpenClaw injects metadata about the current inbound message (sender, channel, timestamp) into the system prompt, this changes with every message and breaks the cache key for every turn.

Structural changes between versions

Every time you edit SOUL.md, AGENTS.md, or any file that loads into the system prompt, the cache key changes. You pay the write cost again on the first turn after the edit. This is expected behavior, not a bug. If you edit these files frequently, the cache write overhead is higher but you still benefit from the read savings within each session run.

OpenClaw version updates

When OpenClaw updates, it restarts the gateway. All active sessions lose their cache state. The first turn after a restart pays the write cost. This is unavoidable but typically low-frequency.

A diagnostic checklist: caching is enabled but bill has not changed

Run through this in order. The first failing check is your fix. Each check takes under a minute.

Are you on Anthropic direct API? If your model path starts with openrouter/, deepseek/, ollama/, or anything other than anthropic/, caching is not available. Switch to Anthropic direct or accept this limitation.
Did you restart the gateway after enabling caching? The setting takes effect on restart. If you edited openclaw.json but did not restart, caching is not active. Restart and start a fresh session.
Did you start a fresh session after the restart? Sessions that started before the config change continue with the old settings. Use /new to start a session that picks up the new config.
Is your prompt prefix stable? Ask your agent to show you the first 500 tokens of the current system prompt. Does anything there change between turns?
Are your sessions long enough? Run a test session for 10+ turns and check the cache token counts in the API response.
Is promptCaching set to "short" or "full"? If it is "none" or missing, caching is disabled.

Run the caching diagnostic on my current setup. Check: (1) which provider I am on, (2) what promptCaching is set to, (3) what appears in the first 500 tokens of the current system prompt and whether anything there changes between turns, (4) what cache token counts the last API response returned. Report findings for each point.

Caching and compaction

OpenClaw’s compaction system and Anthropic’s prompt caching are two separate mechanisms that happen to affect the same resource: the context window. Understanding how they interact avoids surprises in your billing.

Compaction fires when your context window fills up. It summarises the conversation and replaces older turns with a compressed version. This creates a new effective “beginning” for the session context, and the cache key for the compacted prefix differs from the pre-compaction key.

In practice: caching and compaction interact imperfectly. When compaction fires, you pay a write cost for the new compacted prefix on the first post-compaction turn. If compaction fires frequently (every 10-15 turns), the write cost cadence is higher than in a session without compaction, but the cache still delivers net savings on the turns between compaction events.

For operators using LCM (Lossless Context Management) or similar compaction plugins: caching still helps within compaction windows. The turns between one compaction and the next benefit from cache reads on the stable prefix. Just be aware that each compaction event resets the cache for the compacted portion.

Check my current compaction settings. How many tokens are retained after compaction, and approximately how often does compaction fire in my typical sessions? This affects how much benefit I can expect from prompt caching.

How to structure your system prompt for reliable caching

The goal is to put every stable element at the top and every dynamic element at the bottom. Here is the ordering that produces the most consistent cache hits in practice.

Layer 1: Identity and persona (top)

Your name, your persona description, your core behavioral rules. This is the most stable content in your setup and belongs first. Example: the contents of SOUL.md, excluding any session-specific additions you made. This layer changes only when you deliberately edit the persona file.

Layer 2: Workspace structure and file paths

Information about your workspace layout, where files live, what tools are available. This is stable as long as your infrastructure does not change. It belongs in layer 2, after identity. Example: AGENTS.md, TOOLS.md, the workspace path, and any static infrastructure notes.

Layer 3: Constraints and rules

Your operating constraints, safety rules, communication protocols. Stable content that rarely changes. Example: CONSTRAINTS.md, ESCALATION.md, security rules.

Layer 4: Tool definitions (stable)

The tool schemas injected by OpenClaw from installed plugins. You do not control this directly, but it tends to be stable between gateway restarts. The more tools you have installed, the more you benefit from caching this layer.

Layer 5: Dynamic injections (bottom)

Everything that changes turn to turn: current time/date, memory recall blocks, inbound message metadata, session state, active task context. This layer is below the cache boundary and is not cached. Keeping it at the bottom means the cache covers layers 1-4 regardless of what happens here.

This ordering is not arbitrary. Anthropic’s caching mechanism reads the system prompt from the top and caches the longest unchanged prefix it finds. Putting dynamic content in layer 5 means everything above it gets cached every time. Putting a timestamp in layer 1 means nothing gets cached.

The key insight: the cache key is a prefix, not a fingerprint. Anthropic caches the first N tokens of the system prompt. Anything after the stable prefix is not cached. You want the stable-to-dynamic boundary to fall as deep in the prompt as possible.

Show me the current structure of my system prompt. List the sections in order and estimate which ones are stable (same every turn) vs dynamic (change between turns). Based on that, tell me the estimated stable prefix length in tokens, and whether there is any reordering that would push the stable-to-dynamic boundary deeper.

A practical test: after restructuring, ask your agent to show you the first 200 tokens of the system prompt. If you see persona and rules, the structure is correct. If you see a timestamp, a memory block, or metadata, there is still dynamic content above the stable layer. Repeat the reorder until the first 200 tokens are entirely stable text.

One thing to check: OpenClaw injects its own system metadata (current time, model info, session info) as part of the system prompt, typically near the top. This is generated by OpenClaw itself, not your files. If this metadata appears before your SOUL.md content, it may be breaking your cache key. Check whether you can configure where OpenClaw injects its metadata in your version.

What to do if you cannot move dynamic content lower

Some setups have dynamic content near the top for functional reasons. A memory plugin that injects the last 10 memories at the start of every turn. A standing order that requires current context. An integration that embeds session state in a specific position. If you cannot restructure, you have two options.

Option 1: Accept a shorter cache window

If the dynamic content is small (under 200 tokens) and appears after a large stable block, the stable block before the dynamic content is still cached. The cache boundary is the first change, not the total prompt size. A 10,000-token stable block followed by 100 tokens of dynamic content followed by another 5,000-token stable block: only the first 10,000-token block is cached. You still get significant savings on that first block.

Option 2: Switch to DeepSeek V3 for tasks where caching is not available

If your setup makes Anthropic caching impractical, the most cost-effective alternative is to switch the default model to DeepSeek V3. DeepSeek V3 costs approximately 10x less than Sonnet 4 per token with no caching at all. For most operators, switching the default model to a cheaper provider saves more money than enabling caching on an expensive model. The two approaches are not mutually exclusive: route complex tasks to Anthropic with caching, route simple and routine tasks to DeepSeek without.

Complete fix

Cheap Claw

The complete cost reduction playbook. Every lever, ranked by impact. Routing, caching, local models, and spend caps. Drop it into your agent and it reads the guide and makes the changes.

Get it for $17 →

FAQ

Does prompt caching work with OpenRouter?

Sometimes. OpenRouter routes to Anthropic models, but whether it passes Anthropic’s cache control headers through correctly is not guaranteed and varies by request type and OpenRouter version. If caching is not saving money on OpenRouter, the most reliable diagnostic is to temporarily switch to Anthropic direct API and check whether caching fires there. If it does, the issue is the OpenRouter passthrough. Use Anthropic direct for any sessions where caching matters.

How do I know if a cache hit actually happened?

Anthropic’s API response includes cache_creation_input_tokens and cache_read_input_tokens in the usage object. Ask your agent to report these values after any turn. If cache_read_input_tokens is greater than zero on turns 2 and later, caching is working. If both are consistently zero despite being on Anthropic direct with a stable prompt, the cache setting did not take effect. Start a fresh session and check again.

Will caching help if I use a lot of tools?

Yes, potentially significantly. Tool definitions are part of the system prompt and can be cached if they are in the stable prefix. If you have many plugins enabled, the tool schema overhead is substantial and caching those definitions reduces cost on every subsequent turn. The requirement is the same: tool definitions need to appear in the stable upper portion of the prompt, before any dynamic content.

I moved dynamic content lower. How long before I see the savings?

Cache savings accumulate within a session. After making the structural change, start a fresh session with /new and run it for at least 10 turns. The first turn pays the write cost. Turns 2 onward pay the read cost on the cached prefix. A 3-turn test will not show meaningful savings. The session needs to run long enough for the read savings to compound.

Does caching work for cron jobs?

Yes. If your cron jobs run on Anthropic models and their sessions run multiple turns, caching applies. Each cron job run starts a new session (new cache), so the write cost is paid fresh each run. For a cron job that completes in 2-3 turns, caching probably does not help. For a weekly research task that runs 20+ turns, it does. The same conditions apply as for interactive sessions.

Can I enable caching for some tasks and disable it for others?

promptCaching is a global config setting that applies to all sessions in that gateway instance. There is no per-task or per-session override. If you want caching for some tasks and not others, the only option is to run separate gateway instances with different config (which is complex and not recommended for personal setups) or to leave it enabled globally and accept that short sessions pay a small write cost penalty.

My system prompt is 30,000 tokens. Does caching help more with a larger prompt?

Yes, proportionally more. The cost savings scale with the size of the cached prefix. A 30,000-token stable prefix that caches successfully saves 27,000 tokens per turn at the read rate versus the write rate (a 90% reduction on those tokens). With a large stable prompt, even short sessions of 3-4 turns can break even. Operators with large system prompts get the most benefit from caching.

Does caching affect the quality of responses?

No. Caching is purely a billing mechanism. The model receives the same tokens whether they come from cache or from a fresh input. Response quality is unaffected. The only observable difference is in your API bill.

Does caching interact with OpenClaw’s LCM (Lossless Context Management)?

Yes. LCM compacts the conversation when the context window fills. Each compaction event replaces older conversation turns with a compressed summary. This changes the effective beginning of the context, which resets the cache key for the compacted portion. Caching still helps between compaction events. If your sessions compact every 20 turns, you get cache reads on turns 2-20, pay a write cost at the compaction event, then get cache reads on turns 22+ until the next compaction. The net benefit is still positive for long sessions, just slightly less than for sessions that never compact.

My system prompt is mostly loaded from files. Does that help or hurt caching?

It helps. File-based prompts are stable by default: the file contents only change when you edit the file. The cache key stays consistent across turns as long as the file content does not change. If you edit SOUL.md or AGENTS.md during a session, the cache key changes for the next turn, but you pay the write premium once and then get reads for the rest of that session. Operators who load large instruction files benefit most from caching, because the stable file content is exactly what the cache is designed for. If your setup loads SOUL.md, AGENTS.md, TOOLS.md, and other reference files, and those files total 15,000+ tokens, you are in the best position to see meaningful savings from caching. One caveat: if some of those files load conditionally (e.g., only when a certain plugin is active), the stable prefix length may vary between sessions, which reduces cache consistency.

Is there a way to see cache token counts without asking my agent?

Anthropic’s platform dashboard (platform.anthropic.com) shows token usage in your usage logs, but does not currently break out cache read vs cache write tokens in the standard usage view. The cache token counts are available in the raw API response body. Your agent has access to its last response metadata and can report these counts on request. This is the most accessible way to verify caching without using external tooling.

Does caching apply to tool call results?

With promptCaching: "short", only the system prompt prefix is cached. Tool call results in the conversation history are not cached. With promptCaching: "full", you can cache conversation history including tool results, but this requires explicit cache control breakpoints in the conversation structure. For most operators, "short" is sufficient and "full" adds complexity without proportional benefit.

Go deeper

Structuring your system prompt for maximum cache hits

Where to put static vs dynamic content so the cache fires consistently. The ordering that makes caching work.

Read →

OpenClaw uses your most expensive model for everything, even simple tasks

Caching reduces cost on the calls you keep. Routing reduces the number of expensive calls in the first place.

Read →

I woke up to a $300 OpenClaw bill and had no idea what caused it

Find the source before changing settings. Which model, which tasks, and how to see what is actually running.

Read →