My OpenClaw responses got noticeably worse after I switched to a cheaper model

You switched to a cheaper model to cut costs. OpenClaw model quality dropped. Now the responses are shorter, less accurate, or the agent keeps failing mid-task. This article explains exactly what changed and how to get acceptable quality from a cheaper model without switching back.

Before You Start

You switched to a cheaper or local model and the agent’s quality dropped
You want to understand what specifically degraded and why
You want to keep costs low without accepting broken behavior

TL;DR

Cheaper OpenClaw models are not uniformly worse – they are worse at specific things: following complex multi-step instructions, reliable tool use, long context retention, and structured output. If any of those are core to your workflow, a cheaper model will fail there. The fix is not switching back – it is routing only the tasks the cheaper model can actually handle to it, and keeping the expensive model for the tasks that require it.

Reading time: 10 minutes

Jump to a section

What actually degraded and why
How to diagnose exactly what broke
Task routing: the right fix
Adapting your prompts for cheaper models
Where local models hit their ceiling
Building a hybrid routing setup that works

What actually degraded and why

Model capability is not a single dimension. A cheaper model is not just a slower version of a frontier model – it is a different tool with a different capability profile. Understanding where cheaper models fall short tells you exactly which tasks to keep routing to them and which to move back.

Tool use reliability

This is the most common failure mode when switching to a cheaper or smaller model. Tool use requires the model to output structured JSON that matches a specific schema, interpret the result, and decide on the next action. Frontier models (Claude Sonnet, GPT-4, DeepSeek Chat) are trained extensively on tool use and follow the schema reliably. Smaller local models (7B, 8B parameter) have much weaker tool use. They produce malformed JSON, call the wrong tool, or fail to recognize when a tool call is needed at all.

In OpenClaw, bad tool use looks like:

Tasks that worked before now fail silently or produce incomplete results
The agent says it did something but nothing actually happened
Tool call errors in the gateway logs (malformed JSON, unexpected tool name)
The agent gets stuck in a loop attempting the same tool call repeatedly
Commands that require multiple tool calls never complete all steps

Run this simple tool use test: read the file SOUL.md in my workspace and tell me the first line. Then check the gateway logs for any tool call errors in the last 10 minutes. Report both the file content and any errors you see.

If the model completes that test correctly, basic file tool use is working. If it fails or produces errors, tool use reliability is the problem.

Instruction following on complex prompts

Frontier models follow long, multi-part instructions reliably. They read a system prompt with 15 distinct behavioral rules and apply all 15. Smaller models drift. They follow the first few instructions, lose track of the rest, and produce responses that violate rules stated clearly in the system prompt.

If your AGENTS.md or SOUL.md is long and complex, a smaller model will ignore large portions of it. The responses feel generic and off-brand because the model is not actually following the persona, tone, and behavior rules you set.

Long context retention

In a long conversation, cheaper models lose track of context faster. They forget what was decided 20 turns ago, repeat information already established, and stop following earlier instructions that a frontier model would still respect. This is a fundamental architectural difference: smaller models have lower effective context utilization even within the same token window.

Structured output consistency

If your workflow depends on the agent producing consistently structured output (JSON reports, formatted lists, specific markdown patterns), cheaper models produce valid output less reliably. They add extra text, change the format, or drop required fields. Any pipeline step that parses agent output is vulnerable to this.

Reasoning depth

For tasks that require multi-step reasoning (debugging a configuration problem, analyzing trade-offs, synthesizing research), cheaper models produce shallow analysis. They identify the obvious answer quickly and stop. They miss second-order effects and fail to catch logical gaps that a frontier model would flag.

How to diagnose exactly what broke

Before changing anything, confirm which category of degradation you are dealing with. The fix is different for each one.

I want to diagnose why my response quality dropped after switching models. Check the gateway logs for the last 24 hours and count: (1) tool call errors or malformed JSON events, (2) any timeout events, (3) any model fallback events. Then check my current model config and tell me which model is currently set as primary, which tasks it handles, and what the fallback chain is.

The tool use test

Ask the agent to perform a multi-step task that requires two or more tool calls in sequence: read a file, then write a summary of it to a different file. If this fails or produces errors, tool use is the problem. The fix is routing tool-heavy tasks back to a model with reliable tool use.

The instruction following test

Ask the agent to respond to a simple question but with a very specific constraint from your system prompt: use a particular format, avoid a particular phrase, or follow a specific protocol. If the model ignores the constraint, instruction following is degraded.

The context retention test

In an ongoing conversation, state a fact early (“my server IP is 10.0.0.1”), then ask about it 10 turns later without repeating it. If the model cannot recall it or gets it wrong, context retention is degraded.

The reasoning test

Present a simple debugging scenario: “My cron job is configured to run at 9am but it fired at 10am. What could cause this?” A frontier model will identify timezone mismatch, DST, and server clock drift. A degraded model will give one generic answer and stop. If the response is shallow, reasoning depth is the problem.

OpenClaw model routing: the right fix

The correct response to quality degradation is not switching back to the expensive model for everything. It is identifying which tasks broke and routing only those back. This gets you most of the cost savings while restoring quality where it matters.

OpenClaw supports per-task model routing through the model override system. You can set a default cheap model and override it for specific high-stakes tasks.

Show me my current model routing config. Which model is set as default? Are there any per-task or per-context model overrides set? I want to understand the full routing picture before making changes. Do not apply any changes yet.

Tasks cheap models handle well

Simple question and answer from known context in the workspace
File reads and summaries where the summary just needs to be accurate, not structured
Status checks (is the service running, what is the current config value)
Heartbeats (checking HEARTBEAT.md and returning HEARTBEAT_OK)
Drafting a first version of a document that will be edited later
Categorizing or tagging items from a list
Basic math and unit conversions
Retrieving a specific value from a structured file

Tasks that need a frontier model

Any task with 3 or more sequential tool calls
Tasks where the output format must match a schema precisely
Debugging tasks that require reasoning about cause and effect
Tasks that depend on complex persona or behavioral rules from the system prompt
Research synthesis (combining multiple sources into a coherent answer)
Pipeline tasks where a downstream step parses the output
Any task where a wrong answer has real consequences (sending a message, making a config change)

Adapting your prompts for cheaper models

If you want to keep using a cheaper model for a task that is partially degraded, the most effective lever is simplifying the prompt. Cheaper models follow simple, direct instructions much better than long complex ones.

Shorten the system prompt

Long system prompts with many behavioral rules are processed poorly by smaller models. If your AGENTS.md, SOUL.md, and injected workspace files add up to 10,000 tokens, a 7B model is effectively ignoring a large portion of it. Consider maintaining a shorter system prompt variant for cheaper model sessions.

Count the total token size of my current system prompt including all injected workspace files (AGENTS.md, SOUL.md, TOOLS.md, MEMORY.md, and any other files loaded at startup). If the total is above 6,000 tokens, list the three largest files and suggest what could be trimmed or moved to on-demand loading without losing essential behavior.

One instruction per message

Cheaper models handle one clear task per message much better than multi-part requests. Instead of “read the file, summarize it, then write the summary to output.md,” send three separate messages. The model completes each step reliably instead of losing track of the sequence.

Explicit output format

Do not rely on the model inferring the format from context. State it directly: “Respond with a single sentence. Do not include any other text.” Cheaper models respect explicit format constraints far better than implicit ones.

Fewer tool calls per turn

Break tool-heavy tasks into individual steps. If you need the agent to read three files and cross-reference them, ask it to read one at a time and report back. Fewer simultaneous tool calls per turn reduces the chance of a malformed call or a missed step.

Where local models hit their ceiling

Local Ollama models (7B, 8B parameter) have hard limits that prompt engineering cannot work around. Knowing them saves debugging time.

Tool use above 2-3 calls per turn

Most 7B and 8B models can handle 1-2 tool calls per turn with reasonable reliability. Above that, failure rates climb sharply. This is a training data limitation, not a configuration issue. You cannot prompt-engineer a 7B model into reliably orchestrating 5 simultaneous tool calls.

Context beyond 8,000 tokens

Even if your local model supports a 32,000 token context window, effective utilization falls sharply past 8,000 tokens in practice. Instructions stated at token 1,000 are often not followed by token 10,000. This is known behavior in smaller models and is not fixable via configuration.

JSON schema adherence

Smaller models produce valid JSON less reliably when the schema has nested objects or more than 5 fields. If your workflow requires precise JSON output, test it explicitly before relying on it. Some models support structured output modes that improve this, but not all local models support the feature.

Persona consistency across a long conversation

A local model will follow tone and persona instructions in the first few turns of a conversation, then drift toward its default behavior as the conversation lengthens. Frontier models maintain persona consistency much longer. If persona consistency matters to your use case, a local model is not the right choice for interactive conversation.

The false economy of switching everything

Switching your primary model to a local 7B model for all tasks often costs more than it saves. When a cheap model fails a task that requires tool use, the agent typically retries. Each retry is another model call. A task that a frontier model completes in one turn can cost 4-6 local model calls when the local model keeps failing and retrying. At that point you have degraded quality AND higher latency with no cost saving. This is why selective routing outperforms full replacement.

Building a hybrid routing setup that works

The goal is a routing config where cheap model handles the high-volume low-stakes tasks (heartbeats, simple lookups, status checks, drafts) and the frontier model handles the low-volume high-stakes tasks (tool chains, pipeline work, important decisions). This is how you cut your API bill by 60-80 percent without degrading the tasks that matter.

Based on my usage over the last week, estimate what percentage of my API calls are heartbeats, status checks, or simple lookups versus complex tool-use tasks or pipeline work. Then suggest a routing split: which tasks should go to the cheap model and which should stay on the frontier model. Give me the specific config changes needed to implement this split. Show the config but do not apply it yet.

DeepSeek Chat as the middle ground for local model OpenClaw setups

For most OpenClaw operators, DeepSeek Chat (deepseek/deepseek-chat) hits the best balance between cost and capability. At roughly 10x cheaper than Claude Sonnet, it handles complex tool use reliably, follows long system prompts accurately, and maintains persona consistency across normal conversation lengths. If you switched from Sonnet to a local 7B model and quality fell apart, switching to DeepSeek Chat instead often restores quality at 90 percent cost savings.

The three-tier approach

The most cost-effective OpenClaw setups use three tiers:

Local model (free): Heartbeats, simple status checks, file reads, cron jobs that just move files or check conditions. Ollama llama3.1:8b or phi4.
Cheap API model (~$0.30/M tokens): Interactive conversation, moderate tool use, most cron job content generation. DeepSeek Chat.
Frontier model (~$3/M tokens): Complex multi-tool pipelines, high-stakes decisions, anything where a wrong answer costs real time or money. Claude Sonnet or equivalent.

Most operators who implement three-tier routing end up spending 80-85 percent less than they did when everything ran on the frontier model, while maintaining acceptable quality on all but the most demanding tasks.

I want to set up three-tier model routing. Show me the current model config, then draft the config changes needed to route: (1) heartbeats and simple status cron jobs to ollama/llama3.1:8b, (2) general conversation and moderate tool use to deepseek/deepseek-chat, (3) complex pipeline and high-stakes tasks to anthropic/claude-sonnet-4-6. Show the full proposed config. Do not apply yet.

Capability comparison: what each tier can actually do

Before routing tasks, it helps to have a concrete picture of what each model tier handles reliably in OpenClaw. These are practical observations from real operator usage, not benchmark scores.

Local 7B-8B models (llama3.1:8b, mistral 7B)

Task type	Reliability	Notes
File read + summarize	High	Works well for short files under 2,000 tokens
HEARTBEAT_OK response	High	Simple conditional check, minimal token load
Status checks	High	Single tool call, known output format
Single tool call tasks	Medium	Reliable for read/write, less so for exec
2-3 chained tool calls	Low	Failure rate climbs significantly
Complex system prompt follow	Low	Drifts on long prompts above 4,000 tokens
JSON schema output	Low	Unreliable on nested schemas
Multi-step reasoning	Low	Shallow analysis, misses second-order effects

Local 14B models (phi4:latest, mixtral 8x7B)

Task type	Reliability	Notes
File read + summarize	High	Handles files up to 6,000 tokens well
Memory extraction	High	Produces consistent extraction output
2-3 chained tool calls	Medium	Acceptable for simple read-write chains
Moderate system prompt follow	Medium	Up to ~6,000 tokens, then drift increases
Draft content generation	Medium-high	Good drafts, needs editing for consistency
Complex tool chains (4+)	Low	Not recommended for production pipelines
Precise JSON schema output	Medium	Better than 7B, still unreliable on complex schemas

DeepSeek Chat (API, ~$0.30/M tokens)

Task type	Reliability	Notes
All standard tool use	High	Comparable to Claude Sonnet for most tool tasks
Long system prompt follow	High	Handles 15,000+ token system prompts reliably
Complex pipelines (4-6 tool calls)	High	Reliable for most production workflows
Persona consistency	High	Maintains persona across long conversations
JSON schema output	High	Matches complex schemas consistently
Multi-agent orchestration	Medium	Works for most cases; complex ones need Sonnet
Very long context (80k+ tokens)	Medium	Sonnet preferred for very long context tasks

Configuring a reliable fallback chain

OpenClaw supports a fallback chain: if the primary model fails or is unavailable, it falls back to the next model in the list. A well-configured fallback chain means you keep running if one provider goes down, and you can use it as a cost control mechanism too.

Show me my current model fallback chain configuration. How many fallback models are configured? What is the order? If my primary model goes down right now, what model would handle the next request and at what cost? Suggest any changes needed to make the fallback chain match my three-tier routing strategy.

A good fallback chain for cost-optimized setups:

Primary: deepseek/deepseek-chat (main workhorse, low cost, high capability)
Fallback 1: openrouter/deepseek/deepseek-chat (if direct DeepSeek API is down)
Fallback 2: anthropic/claude-sonnet-4-6 (if DeepSeek is fully unavailable)
Fallback 3: ollama/phi4:latest (emergency local fallback, no API dependency)

This chain gives you resilience at multiple levels. If DeepSeek goes down (which happens during high-demand periods), you stay on DeepSeek via OpenRouter. If both DeepSeek routes are down, you fall back to Sonnet at higher cost but with full capability. If all API providers are unreachable, the local model keeps the agent functional for basic tasks.

Verifying the fix actually worked

After making any routing or prompt change, run the same diagnostic tests you used to identify the problem. Do not assume the fix worked because the config change was applied successfully. Test the specific failure you identified.

Run a quality verification test after my recent model routing changes. Test 1: use a tool to read the file AGENTS.md and report the first heading you find. Test 2: follow this exact format requirement and respond with only a JSON object containing two fields, “status” and “model_used”, with their current values. Test 3: what decision did I make in this conversation about model routing? Report pass or fail for each test.

If all three tests pass, your routing fix is working. If any fail, the issue is not resolved and you need to continue diagnosing. A model routing change that shows the right model in the config but still produces bad results usually means the session needs to be restarted (the new model config takes effect on new sessions, not mid-conversation).

Config changes need a new session

Model config changes in openclaw.json take effect when a new session starts. If you changed your primary model and the agent still seems to be using the old one, start a new conversation (/new). The running session was initialized with the old config and will not pick up the change until the session is reset.

When switching back is actually the right call

Selective routing fixes most quality problems. There are three cases where it does not, and switching back to the frontier model for your primary workload is the correct answer.

Your workflow is entirely tool-heavy with no simple tasks

If every interaction involves chained tool calls, structured output, or complex reasoning, there is nothing to route to a cheaper model. The cheap model will fail every task. In this case, the cost optimization path is not model routing but prompt caching, shorter conversations, and reducing unnecessary tool calls. Caching the system prompt alone often cuts costs 40-60 percent on Sonnet without sacrificing any capability.

Review my last 20 conversations and estimate what percentage involved only simple tasks (status checks, file reads, basic questions) versus complex tool chains or pipeline work. If more than 80 percent are complex, recommend cost optimizations that do not involve switching models: prompt caching, context trimming, reducing injected file sizes, or batching tasks.

The quality drop is causing real downstream damage

If the degraded model is making config changes, sending messages, publishing content, or taking any action with real consequences, and it is making mistakes, that is not a tradeoff to live with. The cost of one bad exec command on the wrong file or one garbled Telegram message sent to the wrong person is higher than the cost savings from the cheaper model. Route those tasks back immediately.

You are spending more time debugging the cheaper model than you are saving

This is the honest calculation most operators skip. If you spend 30 minutes debugging why the cheaper model failed a task that the frontier model would have completed in 30 seconds, you have not saved anything. Track actual debugging time for two weeks after switching. If it is more than trivial, the real cost of the switch is higher than the API bill difference.

Quick reference: which model for which task

Heartbeats, HEARTBEAT_OK: ollama/llama3.1:8b (free, fast)
File reads, status checks, cron triggers: ollama/llama3.1:8b or ollama/phi4:latest
Memory extraction and summarization: ollama/phi4:latest
Interactive chat, moderate tool use: deepseek/deepseek-chat
Content generation, research, analysis: deepseek/deepseek-chat
Complex pipelines, 4+ tool chains: deepseek/deepseek-chat or claude-sonnet
Multi-agent orchestration, high-stakes decisions: anthropic/claude-sonnet-4-6
Long context synthesis (80k+ tokens): anthropic/claude-sonnet-4-6

Cut your API bill without degrading what matters.

Cheap Claw: $17

The complete cost control guide for OpenClaw operators. Covers three-tier model routing, per-task routing config, local model capability limits, prompt adaptation for cheaper models, and the full cost breakdown by feature. Paste-ready agent prompts for every optimization. One purchase, permanent access. No subscription.

Get Cheap Claw

The real cost of an OpenClaw model switch: turns, not tokens

This is where the math on cheaper models breaks down most dramatically. A model that costs 60% less per token but takes 3 turns to complete a task that your previous model completed in 1 turn costs 80% more in total. Lower per-token price does not equal lower total cost when the turn count goes up.

Turn count goes up with cheaper models because:

The agent needs clarifying questions to understand tasks the frontier model inferred correctly
Failed tool calls add recovery turns
Incomplete responses require follow-up prompts to get the rest of the answer
Planning tasks require more back-and-forth because the model cannot hold the full plan in working context
Quality is lower, so you prompt again asking for corrections

Look at my last 20 completed tasks in this session. For each task, count the number of conversation turns it took from my initial request to a completed result. What is the average turns-per-task? How does this compare to my experience before the model switch? Estimate the total tokens spent per task including all turns.

The true cost comparison

To compare models fairly, you need cost-per-completed-task, not cost-per-token. A task that takes one turn on model A at $0.03 per 1,000 tokens might cost $0.03 total. The same task taking three turns on model B at $0.01 per 1,000 tokens might cost $0.09 total. Model B is cheaper per token and more expensive per task.

Based on my last 20 tasks: calculate the average tokens-per-task including all turns. Multiply by the cost per token for my current model to get cost-per-task. Then estimate what the same tasks would have cost on my previous model, assuming it completed each in half the turns. Which model is actually cheaper per completed task?

Memory extraction degradation

If you use a memory plugin (like memory-lancedb-pro), the model handling extraction matters. The extraction model reads a conversation and pulls structured facts from it: entities, preferences, decisions, patterns. A weaker extraction model produces worse memories: vague, incomplete, or incorrectly structured entries that do not surface correctly during recall.

The degradation is slow and invisible at first. Memories are still being written. The agent does not report any errors. But the quality of what is extracted is lower. Over time your memory store fills with generic entries that do not help the agent recall specific facts, and you start getting “I don’t remember that” responses for things you discussed weeks ago.

Check which model is configured for memory extraction in my memory plugin settings. What model is it currently using? Run a test: ask me a specific fact we discussed at least a week ago. Attempt to recall it from memory. Did you find it? Was the recalled entry specific or vague? This tells me whether memory extraction quality has degraded.

The memory extraction model is configured separately from the primary conversation model. You can downgrade the primary model to save on conversation costs while keeping a capable model for extraction. The two do not have to be the same.

Show me the current model settings for: primary conversation model, compaction model, memory extraction model, heartbeat model, and each cron job model. These can all be different. Tell me which ones are currently using the cheaper model I just switched to, and whether each of those tasks actually benefits from the cheaper model or whether quality is more important there.

Compaction quality and the long-session cost

Compaction is a hidden cost multiplier when you switch to a cheaper model. When a session gets long enough to trigger compaction, OpenClaw uses a model to summarize the older conversation. If that model is the cheap one you just switched to, the summaries it produces are lower quality: vaguer, less specific, missing key details that were in the original context.

The result shows up as context amnesia. The agent loses track of decisions made earlier in the session, forgets constraints you stated at the start, or contradicts earlier work. You spend additional turns re-establishing context the agent should already have. Each of those turns costs tokens. Over a long session, this erases the per-token savings from the cheaper model.

Check my compaction configuration. What model is handling compaction? How many tokens are retained after each compaction pass? Has compaction run in this session? If it has, tell me what happened to the context that was compacted: is it available as a summary, or is it gone? If the compaction model is the same cheap model I just switched to for conversation, estimate whether this is causing context loss.

The right way to split the cost

Conversation turns and compaction have different requirements. Conversation turns with a capable user who provides clear prompts can work fine with a cheaper model. Compaction needs a model that produces dense, accurate summaries because the output feeds back into future context. The asymmetry means you can often get away with a cheap model for conversation while keeping a better model for compaction.

Recommended model split for cost-sensitive operators

Conversation (primary model): DeepSeek Chat or equivalent mid-tier. Good tool use, good instruction following, low cost per token.
Compaction: same mid-tier or one step up. The few cents extra per compaction pass are worth it to avoid context amnesia in long sessions.
Memory extraction: mid-tier minimum. Poor extraction compounds: bad entries today hurt recall quality for the next six months.
Heartbeats and status checks: free local model. No reasoning needed, no tool use, just read a file and report back.
Cron jobs without tool use: free local model.
Cron jobs with tool use: mid-tier minimum.

Show me the current model for each of these roles in my config: conversation primary, compaction, memory extraction, heartbeat, each cron job. For any role where the current model is likely causing quality degradation (based on the degradation I described), suggest the lowest-cost model upgrade that would fix it. Show me the exact config changes without applying them.

Stop paying more for a cheaper model.

Cheap Claw: $17

The complete cost control guide for OpenClaw operators. Covers the cost-per-task calculation, workload-based model routing, prompt tuning for smaller models, background task separation, and the full model selection framework. Everything as paste-ready agent prompts.

Get Cheap Claw

Quick-reference: symptom to fix

What degraded	Likely cause	Fix
Tool calls failing or wrong parameters	Model does not support function calling reliably	Move to a model with documented tool use support (DeepSeek Chat, Llama 3.1 8B+)
Agent drops instructions from AGENTS.md	Model cannot hold all instructions simultaneously	Trim system prompt to 10 most critical rules, put important ones first
More turns needed to complete tasks	Model lacks planning or reasoning capability	Break tasks into explicit steps in your prompts; or upgrade the conversation model
Memory recall getting worse over time	Memory extraction model producing poor-quality entries	Keep extraction model at mid-tier even if conversation model is cheap
Context from earlier in session gets lost	Compaction model producing lossy summaries	Keep compaction model at mid-tier; upgrade retained token count
Total API bill went up despite cheaper model	More turns per task erasing per-token savings	Calculate cost-per-task, not cost-per-token; find the model with the best task efficiency

Questions people actually ask about this

Will DeepSeek Chat handle my tool-heavy workflows as well as Claude Sonnet?

For most workflows, yes. DeepSeek Chat has strong tool use for standard OpenClaw operations: file reads and writes, web search, exec commands, and git operations. Where it falls short compared to Sonnet is on very complex multi-agent orchestration, tasks with 6 or more chained tool calls, and situations where the model needs to reason about tool outputs before deciding the next step. For 90 percent of typical OpenClaw use, DeepSeek Chat is a good drop-in replacement at a fraction of the cost.

My local model was working fine on simple tasks. Why did it suddenly start failing?

Check whether your system prompt or injected workspace files grew larger recently. As you add more context to AGENTS.md, SOUL.md, or daily memory files, the total system prompt size increases. When it crosses the local model’s effective context utilization limit (roughly 6,000-8,000 tokens for most 7B models), behavior degrades sharply. Run the system prompt size check above and trim if needed.

Is there a way to test a model before committing to it?

Yes. Set the model override in your config, test it with the four diagnostic scenarios in this article (tool use, instruction following, context retention, reasoning), and evaluate the results before making it permanent. Do not test in production with real tasks – test with the diagnostic prompts first. The full testing guide for local models covers this in more detail.

The agent is giving shorter responses now. Is that a quality issue or just a style difference?

Both are possible. Some cheaper models default to shorter responses as a trained behavior, not because they are incapable of longer ones. Try adding “Respond in full detail. Do not truncate your answer.” to your request. If the response quality is acceptable but just shorter, that is a style issue, not a capability limit. If the response is also less accurate or less structured, that is a capability limit.

I switched to a local model for everything and now heartbeats are taking 30 seconds. Is that normal?

Yes. Local model inference time depends on your hardware. On a typical VPS with a CPU-only setup, a 7B model takes 15-45 seconds per response. A 14B model (like phi4) takes 30-90 seconds. If your heartbeat interval is 5 minutes and each heartbeat takes 45 seconds to process, that is acceptable. If it is causing timeouts or backing up, either use a smaller model (llama3.1:8b is faster than phi4) or increase the heartbeat interval to give the model time to complete.

Can I use different models for different OpenClaw channels?

Not directly by channel, but you can use different models for different session types and cron jobs. Interactive sessions (Discord, Telegram conversations) use the primary model configured in the agent defaults. Isolated cron jobs use whatever model is specified in the cron job payload. This means you can run interactive conversations on DeepSeek Chat while running heartbeats on a local model, effectively creating per-channel routing through session type.

My agent stopped following my SOUL.md instructions after switching to a cheaper model. How do I fix this?

The model is not retaining the SOUL.md instructions across the conversation because the total system prompt is too long for it to follow reliably. Two fixes: (1) Trim SOUL.md to the 5-7 most critical behavioral rules, removing examples and explanations that are there for the frontier model’s context but not needed for instruction execution. (2) Move to DeepSeek Chat, which handles long system prompts reliably at a much lower cost than Sonnet.

Go deeper

CostWhich OpenClaw features cost money and which ones are completely freeThe full feature-by-feature cost breakdown with free path for each one.CostOpenClaw keeps hitting rate limits every dayWhy rate limits hit daily and how to stop them without upgrading your plan.Model RoutingHow to test whether a local model can actually handle your workloadThe four tests that tell you whether a local model is ready before you commit to it.