OpenClaw does not break out costs by task. Your provider dashboard shows spend by model, not by what your agent was doing when it ran. If your bill spiked and you have no idea why, that is not an oversight on your part. The visibility just is not there by default. This is how you build it.
Why OpenClaw spend is hard to track by default
When your bill arrives, your provider shows you two things: which model was called, and how many tokens it processed. That is it. There is no breakdown by task, no tag for which cron job fired at 3am, no flag for which plugin made an unexpected call. You get a number and a model name.
Every LLM provider works this way. The problem is that an AI agent running autonomously can generate dozens of separate API calls in a single day, across multiple tasks, some of them scheduled to fire while you are asleep. Without a task log on your side, you are reverse-engineering your own bill from the outside.
The four things that cause unexpected OpenClaw spend, in order of how often they show up in practice:
- Heartbeat tasks running on a flagship model. OpenClaw pings your agent on a schedule to check if anything needs doing. If that ping hits Claude Sonnet or GPT-4, you are paying flagship prices for a status check every few minutes. Overnight, that adds up.
- The wrong default model. If your default model is set to your most capable (and most expensive) model, every task routes through it: file reads, formatting, simple yes/no decisions, everything.
- Cron jobs you forgot about. Cron jobs are scheduled tasks that fire automatically at set intervals (hourly, daily, weekly) without you doing anything. They fire whether or not you are paying attention. A cron job that runs a research task hourly on a flagship model is invisible until the bill arrives.
- Subagents or pipeline tasks that ran longer than expected. Multi-step tasks that spawn subagents multiply your token usage. If a pipeline task loops or retries, the cost multiplies further.
Before you change anything: find the source. Switching models without knowing what caused the spike is just guessing. On our own setup, the culprit turned out to be a heartbeat task hitting a flagship model every five minutes overnight. Without building a task log, we would never have found it.
What normal OpenClaw spend looks like
Before you can identify a spike, you need a sense of what normal looks like. Spend varies significantly depending on how you use OpenClaw, but these rough benchmarks help frame expectations. All figures based on Claude Sonnet 4 pricing as of March 2026 ($3 per million input / $15 per million output) unless noted.
Light personal use (2-3 conversations per day, no cron jobs, no plugins): $0.20 to $0.80 per day. At this usage level, heartbeat and compaction are the hidden costs to watch.
Active personal use (10+ conversations, 2-3 cron jobs, memory enabled): $1.50 to $4.00 per day on a flagship model. Switching to DeepSeek V3 as default brings this to $0.15 to $0.50 per day for the same workload.
Operator deployment (multiple channels, pipeline tasks, high cron frequency): $5 to $20 per day on flagship models. Mixed routing (flagship for complex tasks, local/cheap for everything else) can reduce this to $0.80 to $3.00 per day without meaningful quality loss on routine tasks.
If your spend is significantly above these ranges for your usage pattern, something is misfiring. If it is within these ranges but higher than you expected, the issue is probably routing rather than a bug.
The single most useful benchmark question to ask yourself: Does my spend scale linearly with my activity? If you used OpenClaw three times as much on Tuesday as Monday, did spend roughly triple? If yes, routing is working as expected and you just need to reduce the per-task cost on tasks that do not need expensive models. If spend went up 10x while activity went up 3x, something fired in the background that does not scale with your usage. That points to a cron job, heartbeat, or plugin issue rather than a routing issue.
Look at my spend-log.md if it exists. Does my estimated token usage scale roughly with my activity level, or are there days where spend seems disproportionate to how much I actually used you? Flag any anomalies you can identify.
Step 1: Check your provider dashboard first
Your provider dashboard is the fastest starting point. It will not tell you which task caused the spend, but it will tell you which model is responsible and whether the spike was one day or ongoing. That narrows the search considerably.
Go to your provider usage page now:
- Anthropic: console.anthropic.com/settings/usage
- OpenAI: platform.openai.com/usage
- DeepSeek: platform.deepseek.com/usage
- OpenRouter: openrouter.ai/activity
Look for two things: which model is doing the most damage, and whether the spike was one event or ongoing. These tell you very different things.
If one model is responsible for most of the spend and it is your default: the fix is routing. You are routing tasks to an expensive model that a cheaper model handles just as well. This is the most common cause and the easiest fix.
If spend is spread evenly across models: the problem is volume. Something ran more than you expected. You need the task log in Step 3 to find what.
If the spike was one specific day: look for what changed. Did you run a long research task? Did a pipeline task loop? Did you install a new plugin that made unexpected calls? One-day spikes are a specific event, not a systemic problem.
If spend has been creeping up slowly over weeks: something is running continuously that should not be, or a new recurring task was added without you noticing the cost. Heartbeat and cron job issues present this way.
Show me my current model configuration in openclaw.json. What is set as the default model, the fallback chain, and the heartbeat model? I want to understand which models are active right now.
Your agent will pull the current config and show you exactly what is set. If the heartbeat model is blank or set to your default, that is your first culprit.
Step 2: Understand what OpenClaw charges for
Before you can audit your spend, you need to know all the places OpenClaw makes API calls. There are more of them than most operators expect.
Here is a rough estimate of what a typical bill breakdown looks like for an operator running OpenClaw actively without spend tuning. These percentages vary significantly by setup and are intended as a diagnostic framing, not precise benchmarks:
- Direct conversation (what you think you are paying for): 15-25% of total spend. The interactive messages you send and receive.
- Heartbeat and idle polling: 20-35% of total spend, depending on heartbeat frequency and system prompt size. This is the number that surprises most operators.
- Cron jobs: 15-30% of total spend, depending on how many jobs are active and how complex each task is.
- Memory extraction and compaction: 10-20% of total spend, depending on plugin configuration and how often context fills up.
This means the conversations you are actually having can represent less than a quarter of your total bill. The rest is background infrastructure. Fixing the visible part (model choice for conversations) without addressing the background parts leaves most of the spend untouched.
Interactive messages. Every time you send a message to your agent and it replies, that is an API call. This is the one most people think of, but for active deployments it is not the biggest line item.
Heartbeat polls. OpenClaw pings your agent on a schedule to check if anything in HEARTBEAT.md needs attention. The frequency depends on your config. If the heartbeat model is not set to a local or cheap model, every ping costs money. At the default Sonnet rate as of March 2026 ($3 per million input tokens), a system prompt of 8,000 tokens costs roughly $0.024 per call. At a heartbeat frequency of every 5 minutes, that is 288 calls per day, or approximately $6.91 per day just for idle status checks. Even at every 30 minutes, that is $1.15 per day for status polls that return nothing actionable.
Cron jobs. Every scheduled task that fires triggers an API call. The model it uses depends on how the cron job is configured. If it is not specified, it falls back to your agent default.
Subagents and pipeline tasks. A subagent is a separate agent instance your main agent can spawn to handle a specific task in parallel. When your agent spawns a subagent, that subagent makes its own API calls. These show up in your provider dashboard under whatever model the subagent uses, not attributed to the parent task.
Memory extraction. If you have autoCapture enabled in your memory plugin (autoCapture automatically stores relevant information from conversations as long-term memories), the plugin makes additional LLM calls to extract and store memories from conversations. This depends on your plugin configuration and which model is used for extraction.
Compaction. When your context window fills up, OpenClaw runs a compaction pass using an LLM call. If compaction happens frequently (because context is filling fast) and uses an expensive model, this becomes a recurring cost.
Show me my current cron jobs and heartbeat configuration. How many cron jobs are active, how often do they fire, and what model are they set to use? Also show me whether memory autoCapture is enabled and what LLM it uses for extraction.
~/.openclaw/openclaw.json. Look for agents.defaults.heartbeatModel to see the heartbeat model setting. For cron jobs, find the cron.jobs array. Each job entry has an optional model field that overrides the default. If the field is missing, the job uses the agent default.
Step 3: Build a task log so you can see what is actually running
Provider dashboards show you spend by model. They do not show you which tasks caused it. For that you need your own log. This is the single most useful thing you can do to get long-term visibility into your spend.
Add this to your task completion protocol: after every significant task, append one line to /workspace/spend-log.md in this format: [date] | [task name] | [model used] | [tokens in] | [tokens out] | [task category]. If you do not know the exact token counts, estimate based on the response size. The goal is to identify patterns, not perfect accounting.
After three to five days, check the log. The expensive tasks will show themselves. The log will grow over time. Archive or truncate it monthly to keep it manageable. You do not need historical granularity beyond a few weeks for cost diagnosis. One or two recurring tasks almost always account for a disproportionate share of total spend.
What you are looking for in the log:
- High-frequency low-value tasks. If something appears dozens of times per day and each entry has a non-trivial token count, that task is a routing problem. It should be on a local or cheap model.
- Unexpectedly large single tasks. A research task or pipeline run that consumed 50,000+ tokens is a one-time event, not a systemic problem. But if it recurs, it needs either caching or a cheaper model.
- Tasks that do not appear in the log at all. If your provider dashboard shows spend that the log does not account for, something is making API calls that your logging does not cover. Memory extraction and compaction are the most common culprits.
Read /workspace/spend-log.md if it exists. Summarize total entries by task category and by model. Identify the top 3 tasks by estimated token usage. If the file does not exist yet, tell me.
Step 4: The four root causes, and how to diagnose each one
Once you have your provider dashboard data and a few days of task log, you can match the spend pattern to one of four root causes. Each one has a different fix.
Root cause 1: Wrong default model
Symptoms: spend is evenly distributed across all types of tasks. No single task category dominates. The bill tracks directly with how active you were that day.
Diagnosis: your default model is your most expensive model, and it is handling everything, including tasks that do not need it.
Fix: switch your default to DeepSeek V3 or equivalent. As of March 2026, DeepSeek V3 costs approximately $0.27 per million input tokens versus $3 per million for Claude Sonnet 4. For tasks that do not require complex reasoning or long context, the output quality is comparable.
Check my current default model setting. If it is set to an Anthropic or OpenAI flagship model, change it to deepseek/deepseek-chat and restart. Keep the flagship model as a fallback. Tell me what the config looks like before and after.
Root cause 2: Heartbeat running on a paid model
Symptoms: steady background spend even on days when you barely used OpenClaw. The spend does not drop to zero overnight.
Diagnosis: your heartbeat is firing on a schedule and hitting a paid API model instead of a local one.
Fix: set your heartbeat model to a local Ollama model. Heartbeat checks only need to read HEARTBEAT.md and decide if anything needs doing. A local llama3.1:8b handles this at zero cost.
What model is currently set for heartbeat tasks? If it is not set to a local Ollama model, change it to ollama/llama3.1:8b. Show me the config change.
Root cause 3: Cron jobs on expensive models
Symptoms: spend spikes at predictable times (every hour, every morning, every Sunday) even when you have not done anything. The provider dashboard shows activity at times you were not using OpenClaw.
Diagnosis: one or more cron jobs are firing on a schedule, and they are using your expensive default model instead of a cheaper one.
Fix: audit all active cron jobs and set an explicit model for each one. Simple cron tasks (queue checks, summaries, status reports) do not need flagship models.
List all my active cron jobs. For each one, show me the schedule, what it does, and what model it uses. Flag any that are using a flagship paid model for tasks that do not require it.
Root cause 4: Memory extraction or compaction overhead
Symptoms: spend does not correlate directly with your activity. You send a few messages, but the provider dashboard shows significantly more tokens than those messages would account for. Or spend creeps up over weeks without a clear trigger.
Diagnosis: background processes (memory extraction, compaction) are making additional LLM calls that you are not counting.
Fix: check which model your memory plugin uses for extraction. If it is a flagship model, switch it to a local model or cheaper API model. For compaction, check compaction.model in your config.
Check my memory plugin configuration. What model is set for extraction? Also check compaction.model in my config. If either is set to a flagship paid model, what would the change look like to move them to a cheaper option?
Step 5: Fix the most common causes right now
Based on the diagnosis above, here are the three fixes that reduce spend most in the shortest time. Apply them in this order.
Fix 1: Route heartbeat to a local model. This is the fastest win. If your heartbeat is hitting a paid model, switching it to Ollama costs you nothing and reduces background spend immediately. The change takes effect on the next heartbeat cycle. If you do not have Ollama installed, set the heartbeat model to DeepSeek V3 as a minimum. Any routing away from a flagship model for heartbeat checks is a net win.
After making this change, watch your provider dashboard for 24 hours. If you had steady background spend even on idle days, it should drop noticeably. Heartbeat spend is almost always underestimated because it is invisible in the moment and only shows up as a diffuse daily cost.
Fix 2: Change your default model. Switch from a flagship model to DeepSeek V3 or equivalent as your default. Your agent will use the expensive model when you explicitly request it (for complex tasks or when the cheaper model falls short), but everything else routes through the cheaper option automatically.
This is the highest-impact single change for most operators. The spend reduction is immediate and does not require any change to how you use OpenClaw. You just start new sessions. One important note: if you have tasks in your instructions or SOUL.md that specify a model by name, those override the default. Make sure your protocol files do not hard-code an expensive model for routine tasks.
Fix 3: Set explicit models on your cron jobs. Go through every cron job and add an explicit model override for each one. A morning summary does not need Claude Sonnet. A queue processor checking for pending tasks does not need GPT-4. A weekly memory cleanup does not need a flagship model. Set the model to match the task complexity, not to a vague “best available.”
A useful rule of thumb for cron job model selection: if a human intern could do the task in under 2 minutes with access to the relevant files, a local model handles it fine. If the task requires nuanced judgment, synthesis across long context, or complex tool use, use a capable paid model. Most cron jobs fall into the first category. Queue processors, summary tasks, reminder jobs, and status checks are almost always in the first category.
I want to make three changes: (1) set heartbeat model to ollama/llama3.1:8b, (2) set default model to deepseek/deepseek-chat, (3) review all active cron jobs and add explicit model overrides of ollama/phi4:latest for any that are simple tasks. Show me the proposed config changes before applying them.
Step 6: Consolidate to OpenRouter for a unified spend view
If you are running Anthropic, OpenAI, and DeepSeek separately, you are checking three dashboards and adding them up manually. OpenRouter routes all of them through one interface with a single spend view, a unified activity log, and per-model breakdowns in one place.
This is not required, but if you are managing multiple providers, the visibility improvement is significant. You can see all your model spend in one place, set spending limits across providers, and get a single daily or weekly summary.
Check my current model configuration. Which providers am I currently using directly (Anthropic, OpenAI, DeepSeek)? If I wanted to route all of them through OpenRouter instead, what model strings would need to change and what would the new config look like?
openrouter/anthropic/claude-sonnet-4-6 instead of anthropic/claude-sonnet-4-6. If you switch providers without updating model strings, routing breaks silently. Your agent will report an error on the next task. Always verify model strings after a provider change.
Your agent will read your actual config and tell you exactly what needs to change for your specific setup. Do not update model strings manually without reviewing the full list. It is easy to miss one entry in a fallback chain.
How to verify the fixes worked
After applying the three fixes in Step 5, you need to confirm they took effect. Config changes do not always behave as expected, and there are several common reasons a change does not reduce spend as much as anticipated.
Check 1: Confirm the new default model is active. Start a fresh session (use /new), then ask your agent: “What model are you currently using?” It should report DeepSeek V3 or whatever cheap model you configured. If it still reports your flagship model, the config change did not take effect or the new session is not picking it up.
What model are you currently running on? What is set as the default model in my config right now?
Check 2: Confirm heartbeat model changed. Wait for a heartbeat cycle to fire (or trigger one manually), then check your provider dashboard. If the heartbeat is now routing to Ollama, you should see no new API calls during idle periods. If you still see calls, the heartbeat model setting did not apply, or the heartbeat is not configured to use the override.
Check 3: Compare spend after 24 hours. Wait one full day after making the changes, then compare your provider dashboard spend to the same day the previous week. The reduction should be visible in the per-model breakdown. Heartbeat and idle spend will appear as a drop in Anthropic or OpenAI calls. Task spend will shift to the cheaper model’s line item.
Step 7: Set up a spend alert so this does not happen again
Every provider listed above has some form of spend alert or usage cap. These exist for exactly this situation. Set them now, before the next bill.
Anthropic: Go to console.anthropic.com/settings/limits. You can set a monthly spend cap and email alerts at thresholds you choose.
OpenAI: Go to platform.openai.com/account/limits. Set a monthly limit and configure usage alerts.
OpenRouter: Go to openrouter.ai/settings/limits. Set a credit limit and alerts per model or overall.
DeepSeek: As of March 2026, DeepSeek’s API console at platform.deepseek.com does not have a self-serve spend cap. Monitor usage manually or set a calendar reminder to check weekly.
Set up a weekly cron job using ollama/llama3.1:8b as the model. Fire every Sunday at 9am. Task: remind me to check my API spend for the week. The reminder should include: check provider dashboard for weekly spend, check spend-log.md for any anomalies this week, and compare to previous week’s spend.
Beyond provider alerts, you can also build a soft cap inside OpenClaw itself. This does not stop API calls that are already in flight, but it prevents your agent from starting new tasks once a threshold is reached.
Add a rule to your operating protocol: if today’s estimated spend in the task log exceeds $5, pause all non-urgent tasks and notify me via Telegram before starting anything new. The threshold resets at midnight.
Complete fix
Cheap Claw
The complete cost reduction playbook. Every lever, ranked by impact. Drop it into your agent and it reads the guide and makes the changes. Operators report 60-80% spend reduction within a week.
Once you have run through these steps, your spend should be visible, attributed, and lower than it was before. The task log tells you what caused the last spike. The provider alerts prevent the next one. The model routing changes fix the systemic leak. Those three things, together, are most of the work.
FAQ
Why can’t I just set a hard spending cap on OpenClaw directly?
OpenClaw does not have a native hard cap tied to provider billing. You can build a soft cap by instructing your agent to track estimated spend and pause after a threshold, but the provider will still charge for any API calls that complete before the agent stops. For a hard cap, set spend limits directly in your provider dashboard. Anthropic and OpenAI both support monthly spend caps with email alerts. DeepSeek does not have a self-serve cap as of March 2026, so manual monitoring is required there.
How do I know which model actually ran for a specific task?
The easiest way is to ask your agent directly after a task completes: “What model did you just use for that?” It will report the model from its own context. For a persistent record, the task log approach in Step 3 above captures this automatically over time. If you use OpenRouter, the activity log at openrouter.ai/activity shows per-call model attribution, which is the most reliable external record.
Will switching to DeepSeek V3 break my existing setup?
Changing your default model only affects new sessions. Existing sessions continue with whatever model they started on. The main risks are: tasks that rely on very long context windows (DeepSeek V3 supports 64k context as of March 2026, versus 200k for Claude Sonnet 4), and tasks that use tool calling heavily (DeepSeek V3 handles tool use well but behaves differently on some edge cases). Test on a fresh session before committing, and keep your flagship model as a fallback in the config.
How often should I check my spend log?
Daily for the first two weeks after setting it up, then weekly once you have a baseline. You are looking for tasks costing significantly more than their average, or new tasks that did not exist in your baseline. Anomalies are easier to spot when you know what normal looks like. The Sunday reminder cron job in Step 7 automates this check so you do not have to remember.
My provider dashboard shows spend but my agent says it did not do anything. What caused it?
Three common causes: (1) heartbeat tasks that fired while you were away and your agent considers routine enough not to mention. Second, memory extraction: if autoCapture is enabled, the memory plugin makes its own LLM calls that the agent may not report. Third, compaction: if the context window filled and compaction ran, that is an LLM call the agent does not narrate. Use the Step 2 diagnostic blockquote to check all three.
Can I get spend data by task from my provider without building my own log?
Not directly from the provider. None of the major API providers (Anthropic, OpenAI, DeepSeek, OpenRouter) expose task-level attribution. They only see API calls, not what caused them. OpenRouter gets closest with per-call logs at openrouter.ai/activity, but those show timestamps and models, not task context. The task log you build yourself is the only way to map spend to the actual work that caused it.
What is a realistic spend reduction after making these changes?
Switching from Claude Sonnet as default to DeepSeek V3 as default typically reduces per-token cost by roughly 90% for the tasks that move. Routing heartbeats to Ollama eliminates that cost entirely. Operators who make both changes typically see 50-70% total spend reduction within a week. The remaining spend comes from the tasks that genuinely need the expensive model, which is a much smaller proportion than you expect.
If I switch to OpenRouter, do I lose direct access to the providers?
No. You can keep direct provider API keys active and add OpenRouter in parallel. OpenClaw supports multiple providers in the config simultaneously. The typical approach is to route your regular tasks through OpenRouter for unified visibility, while keeping direct API keys as a fallback in case OpenRouter has an outage. OpenRouter itself routes to the same underlying provider APIs, so latency is marginally higher but quality is identical.
How do I stop a task that is running and costing money right now?
Send a message to your agent with a direct stop instruction: “Stop what you are doing immediately. Do not start any new tasks. Confirm you have stopped.” For a running cron job, use the cron management tool to disable it. For a subagent that has gotten into a loop, you can kill it from the session management interface. If the agent is unresponsive, restarting the OpenClaw gateway stops all active sessions immediately, but you will lose any unsaved context from those sessions.
Does it help to reduce the system prompt size to lower costs?
Yes, significantly. Your system prompt is included in every API call as part of the input token count. A 10,000-token system prompt on a heartbeat call that fires every 5 minutes adds up to millions of tokens per month just for the system prompt alone. Reducing system prompt size is one of the highest-leverage cost levers available. See How to reduce system prompt size without losing context for the full walkthrough.
Should I worry about token count differences between providers?
Yes. Different providers use different tokenization schemes, so the same text can produce different token counts across Anthropic, OpenAI, and DeepSeek. The differences are usually 5-15% for English text. This means comparing token counts across providers is approximate, not exact. For budgeting purposes, use per-dollar cost as your comparison metric rather than raw token count, since each provider has different pricing per token anyway.
My bill spiked after I installed a new plugin. Can plugins make API calls without my agent knowing?
Yes. Plugins with their own LLM integration (memory plugins, summarization plugins, classification plugins) can make API calls through their own configured LLM, separate from what your agent reports. These calls show up in your provider dashboard under whatever model the plugin uses, which is separate from your agent’s model. Check your plugin configurations for any LLM settings, and verify they are pointed at cheap models rather than flagship ones. The ClawHub crisis of early 2026 involved malicious plugins making unexpected API calls as part of exfiltration attempts, which also shows up in spend anomalies.
Go deeper
OpenClaw uses your most expensive model for everything, even simple tasks
Once you know which model is causing the spend, here is how to stop routing simple tasks through it.
Setting spend limits so your agent stops at night
Agents that run while you sleep can rack up bills with no kill switch. How to build one.
I turned on prompt caching and my bill did not change at all
Why caching has no effect after you enable it and what needs to change for it to actually work.
