Benchmarks rank models on standardized tests. Your workload is not a standardized test. The model that scores highest on reasoning evals may be overkill for 80% of what you actually run through OpenClaw, and the cheapest model may be perfectly adequate for tasks you are currently paying 10x more to handle. The mismatch between benchmark performance and real-world task requirements is one of the most common reasons operator costs are higher than they need to be. This article covers what the actual axes of difference are between models, how to categorize your tasks against those axes, and how to build a routing setup that uses the right model for each job without manually managing every call.
TL;DR
- There are three axes that matter for task-model fit: capability (can it do the task?), speed (how fast?), and cost (what does it charge per token?). These do not move together. Cheap models can be fast; expensive models can be slow.
- Most workloads are mixed. Some tasks genuinely need a capable frontier model. Most do not. Routing them all to the same model is the default, and it is expensive.
- The default model setting handles your average case. Explicit overrides handle the exceptions in both directions: up to a stronger model for complex tasks, down to a local model for simple ones.
- Local models are free. If you have Ollama running, routing eligible tasks to a local model costs nothing. The question is which tasks qualify.
- Audit your actual task mix before changing anything. The right routing depends on what you run, not on general advice about which models are good.
Throughout this article you will see indented blocks like the ones below. Each one is a command you can paste directly into your OpenClaw chat. Your agent will run it and report back. You do not need to open a terminal or edit any files manually.
Model comparisons usually focus on benchmark scores. Those scores compress a lot of nuance into a single number and are often measured on tasks that are harder than what most operators run day to day. For practical routing decisions, three axes matter more than any leaderboard position.
Capability
Can the model do the task correctly and reliably? Capability is not binary. It varies by task type. A model that is excellent at following structured instructions may be poor at multi-step reasoning. A model that scores well on coding benchmarks may struggle with nuanced tone matching for writing tasks. A model that handles long-context retrieval well may be slow and expensive relative to its actual output quality for short tasks.
The question is not “is this model good?” It is “is this model good enough for this specific task type?” Good enough is a lower bar than best, and good enough is usually all you need.
Speed
Speed has two components. Time-to-first-token is how long you wait before any output appears. Tokens-per-second is how fast the output streams after that. Both matter, but differently depending on the task. A background cron job running at 3am does not need fast time-to-first-token. An interactive chat session where you are waiting for a response does.
Larger models are generally slower. This is not a bug. It is physics: more parameters means more computation per token. Frontier models like Claude Opus or GPT-4o are meaningfully slower than mid-tier models like Sonnet or DeepSeek V3, which are meaningfully slower than small fast models like Haiku or local 8B models. Speed matters when the task is interactive. It mostly does not matter for background work.
Cost
Cost is per million tokens, split between input (what you send) and output (what the model generates). The spread across models available in 2026 is enormous. At the high end, Claude Opus 4 costs $15 per million input tokens and $75 per million output tokens. At the low end, DeepSeek V3 costs $0.28 per million input tokens and $1.10 per million output tokens. A local Ollama model costs nothing per token. That is a 50x difference between the cheapest API model and the most expensive frontier model, and an infinite difference between API and local.
The cost difference only matters if the models are interchangeable for your task. If Opus is the only model that can do something correctly, its price is what it is. But if DeepSeek V3 handles a task with the same quality, using Opus for it is paying 50x for no gain.
Look at my current model routing configuration. What is my default model? What fallbacks are configured? Are there any per-task overrides set? What does a typical session cost in tokens, and what model handles most of that spend?
Before you can route intelligently, you need a working taxonomy of what tasks you actually run. Most OpenClaw workloads break into four categories with different model requirements.
Category 1: Simple and mechanical
Tasks with clear inputs, clear outputs, and no ambiguity. File reads, status checks, format conversions, config lookups, heartbeat pings, queue checks that find nothing and exit, cron job runs that read a file and write a summary. These tasks do not require reasoning. They do not require nuanced judgment. They require following instructions correctly with a short input and a short output.
Model requirement: minimal. A local 7B or 8B model handles these tasks as well as a frontier model, at a fraction of the cost, with lower latency. If you have Ollama running, these tasks should not be hitting an API model at all.
Category 2: Standard and compositional
Tasks that require combining information, making reasonable inferences, writing coherent prose, or handling multi-step instructions with some judgment involved. Research summaries, content drafts, config changes with explanation, data analysis with interpretation, most conversational sessions. These tasks benefit from a capable model but do not require the frontier tier.
Model requirement: mid-tier. DeepSeek V3, Claude Haiku 3.5, and similar models in the $0.25 to $1.00 per million input range handle this category well for most operators. This is where your default model should sit. It covers the majority of your workload at roughly 10x less cost than the frontier tier.
Category 3: Complex and reasoning-heavy
Tasks that require multi-step planning, synthesizing large amounts of context, making judgment calls in ambiguous situations, or producing output that needs to be both nuanced and accurate. System design, strategic planning, long-context analysis with citations, multi-agent orchestration, debugging complex issues where the root cause is not obvious. These tasks benefit from the strongest models available.
Model requirement: frontier or reasoning-specialized. Claude Sonnet 4, GPT-4o, or a reasoning model like DeepSeek R1 for tasks requiring deliberate step-by-step logic. These tasks are expensive per run, but they are also the tasks where model quality genuinely changes the output. Do not cut corners here by defaulting to cheaper models. The cost of a wrong answer on a complex task often exceeds the cost of the better model.
Category 4: Local-eligible background work
A subset of Category 1 and 2 tasks that run unattended, are not time-sensitive, and where a slightly lower quality output is acceptable because the task is low-stakes. Heartbeat checks, queue polling on empty queues, log writes, memory cleanup, nightly summaries that you skim rather than act on. These are strong candidates for local model routing.
Model requirement: local only. If Ollama is running on your machine or server, phi4, llama3.1:8b, or qwen2.5-coder:7b handle these tasks at zero API cost. The only tradeoff is speed for the first run after a cold start, and slightly lower quality on tasks at the edge of the model’s capability.
Look at my last week of sessions and cron job runs. Categorize the tasks I ran by type: simple and mechanical, standard and compositional, complex and reasoning-heavy, or local-eligible background work. What percentage of my tasks fall into each category? What model handled each one?
The output will show you where the mismatch is. If 60% of your tasks are simple or local-eligible but they are all hitting your default API model, you have found your biggest cost lever.
Brand New Claw
The complete production configuration guide. Model routing, context sizing, compaction tuning, and the settings that quietly cost you money after you go live.
OpenClaw has three layers where model selection happens. Understanding all three is necessary before you can route intelligently.
The default model
Set in agents.defaults.model in your openclaw.json. This is what handles every task that does not have a more specific override. It is the workhorse. It should be set to the best model that is cheap enough to use for your average task, not your hardest task and not your easiest one. For most operators in 2026, this means something in the DeepSeek V3 / Claude Haiku 3.5 tier.
The fallback chain
Set in agents.defaults.fallbackModels (or equivalent depending on your config version). If the primary model fails, returns an error, or hits a rate limit, OpenClaw tries the next model in the chain. This is a reliability mechanism, not a routing mechanism. It does not select a better model for a harder task. It just catches failures. Your fallback chain should go from your primary to a similarly-capable alternative, not from a cheap model directly to an expensive one that fires for any failure.
Explicit per-task overrides
The most powerful layer. When your agent is writing its own instructions, reasoning about which model to use for a given task, and calling the right model explicitly, you get true intelligent routing. This requires your AGENTS.md or SOUL.md to include clear model routing rules: which model for which task type, when to escalate, when to step down. The agent reads those rules and applies them on each turn.
Read my AGENTS.md and SOUL.md. Do I have explicit model routing rules? Are those rules specific about which task types get which models? Are there any task types I run regularly that are not covered by the current routing rules?
The goal is a three-tier setup: local for eligible background work, mid-tier API for the majority of tasks, frontier for genuinely hard tasks. Most operators already have the infrastructure for this without realizing it.
Tier 1: Local for background work
If Ollama is installed, you already have a free tier. The question is whether you are using it. Check:
Is Ollama running on this machine? What models are currently available? Which of my cron jobs are currently using an API model that could be switched to a local model without a quality loss?
Cron jobs that check a queue and find nothing, write a log entry, summarize a short file, or send a simple notification are strong candidates for local routing. The quality difference between phi4 and DeepSeek V3 on “check this file for PENDING items and list them” is negligible. The cost difference is the entire API bill for those runs.
Tier 2: Mid-tier as default
Your default model should not be a frontier model. If it is currently set to Claude Sonnet or GPT-4o as the default, every file read, every status check, every heartbeat, and every routine response is going through a $3/M token model. Switch the default to a mid-tier model and add explicit overrides to escalate to the frontier tier when needed.
"agents": {
"defaults": {
"model": "deepseek/deepseek-chat",
"fallbackModels": ["openrouter/deepseek/deepseek-chat", "anthropic/claude-haiku-3-5"]
}
}
The fallback chain here stays in the mid-tier. It does not escalate to a frontier model on failure. If DeepSeek is down, you fall back to another cheap option, not to Sonnet.
Tier 3: Frontier on explicit escalation
Add routing rules to your AGENTS.md that tell the agent when to escalate. Be specific. Vague rules like “use Sonnet for complex tasks” do not work well in practice because the agent has to judge what “complex” means. Specific rules do:
Use anthropic/claude-sonnet-4-6 (alias: sonnet) when:
- The task requires multi-agent orchestration
- Long context with citations (5+ sources to synthesize)
- Complex tool chains (4+ tool calls in a single task)
- System design or strategic planning
- Previous attempt with default model produced wrong output
Use deepseek/deepseek-reasoner (alias: ds-reason) when:
- Multi-step logical reasoning is explicitly required
- Mathematical or algorithmic problem solving
- Debugging where root cause is not obvious
Use ollama/phi4:latest (alias: local-quality) when:
- Background cron jobs with no API requirement
- File reads and status checks
- Summaries of short content
- Subagent tasks that do not need internet access
The agent reads these rules at session start and routes accordingly. No manual intervention needed.
Look at my current model routing rules in AGENTS.md. Compare them to the tasks I actually ran last week. Are there gaps? Task types I ran regularly that are not explicitly covered? Draft the specific routing rule additions that would close those gaps.
The model landscape shifts fast. This is the tier structure as of March 2026. Specific model versions change; the tier logic stays the same.
Local tier (free): llama3.1:8b for fast simple tasks, phi4:latest for slightly higher quality local work, qwen2.5-coder:7b for code-focused tasks. All require Ollama running locally. No API key, no per-token cost, no rate limits. Cold start latency on larger models is the main tradeoff.
Mid-tier API: DeepSeek V3 at $0.28/$1.10 per million tokens is the current cost-performance leader for general tasks. Claude Haiku 3.5 is faster with slightly better instruction following at comparable cost. Both handle the majority of standard operator workloads well.
Frontier: Claude Sonnet 4 for complex tool-heavy tasks and long-context work. DeepSeek R1 for reasoning-intensive tasks where step-by-step logic matters. Both are significantly more expensive than mid-tier and should be reserved for tasks that actually need them.
Strategic only: Claude Opus 4 for system design, high-stakes synthesis, and tasks where output quality has direct business consequence. Use sparingly. The per-session cost is real.
Getting routing right is one of the highest-leverage configuration changes you can make. The complete guide walks through the full audit: categorizing your task mix, building the routing rules, testing the routing, and verifying cost reduction without quality loss.
Complete guide
Brand New Claw
The full production configuration audit. Model routing, context sizing, compaction tuning, system prompt optimization, and security hardening. Drop it into your agent and it audits your current setup and proposes the changes ranked by impact.
Common questions
How do I know if a cheaper model is producing worse output on my tasks?
Run the same task on both models and compare. For tasks with verifiable outputs (code, config changes, structured data), the comparison is straightforward. For tasks with subjective outputs (writing, summaries), read both versions and judge whether the quality difference matters for your use case. Most operators find that cheaper models underperform on 10 to 20% of their tasks and are equivalent on the rest. Route that 10 to 20% to a stronger model explicitly and let the cheaper model handle everything else.
My agent keeps using the default model instead of escalating. Why?
The routing rules are probably too vague. “Use a stronger model for complex tasks” requires the agent to judge complexity, and it will usually default to the easier interpretation. Make the rules specific: list the exact task types that trigger escalation, with examples. The more concrete the rule, the more reliably the agent applies it.
Should my fallback model be cheaper or more capable than my primary?
Neither, ideally. Your fallback chain is a reliability mechanism, not a quality escalation path. It should be a similarly-capable model at a similar price point, an alternative that can handle the same tasks as your primary in case of outage or rate limiting. If you want a quality escalation path, that belongs in your routing rules, not your fallback chain.
Does using a local model for cron jobs affect output quality?
Depends on the task. For queue checks, log writes, and simple status summaries, a local phi4 or llama3.1:8b produces output that is indistinguishable from a frontier model in practice. For tasks that require synthesis, nuanced judgment, or complex instruction following, the gap is real. Audit each cron job individually: is this task simple enough that the output quality of a local model is sufficient? If yes, switch it. If not, keep it on an API model.
How often should I revisit my model routing setup?
When your task mix changes significantly, when a new model releases that changes the cost-performance tradeoffs, or when your API bill goes up unexpectedly. The model landscape in 2026 is moving fast. A routing setup that was optimal six months ago may not be today. A quarterly review of model pricing and capability against your actual workload is enough to stay on top of it.
Get practical OpenClaw operator notes straight to your inbox.
From the same series
Why is OpenClaw so slow? It is probably your context window.
Model choice affects speed, but context window size is usually the bigger lever. Start here if latency is the problem.
Why does OpenClaw keep compacting even on short sessions?
If switching to a cheaper model makes compaction fire more often, here is why and how to fix it.
How to reduce system prompt size without losing context
Model cost multiplies your system prompt size on every turn. A leaner prompt makes every routing decision cheaper.
Model capability matrix
This matrix shows which models handle which task categories well. Use it to match your workload to the right model.
| Task category | phi4:latest (local) | DeepSeek V3 | Claude Sonnet | Claude Opus |
|---|---|---|---|---|
| File reading, status checks | ✅ Excellent | ✅ Excellent | ✅ Excellent | ✅ Excellent |
| Structured output (JSON, YAML) | ✅ Good | ✅ Excellent | ✅ Excellent | ✅ Excellent |
| Multi-step reasoning | ⚠️ Moderate | ✅ Good | ✅ Excellent | ✅ Excellent |
| Tool calling (complex chains) | ⚠️ Moderate | ✅ Good | ✅ Excellent | ✅ Excellent |
| Research synthesis | ⚠️ Moderate | ✅ Good | ✅ Excellent | ✅ Excellent |
| Long-form writing | ⚠️ Moderate | ✅ Good | ✅ Excellent | ✅ Excellent |
| Code review and debugging | ⚠️ Moderate | ✅ Good | ✅ Excellent | ✅ Excellent |
| Cost per 1M tokens | $0 | $0.27/$1.10 | $3.00/$15.00 | $15.00/$75.00 |
Map my current tasks to this matrix. For each task category I regularly perform, tell me which model is the most cost-effective that still provides acceptable quality. Show me where I could switch from a more expensive model to a cheaper one without losing quality.
Benchmarking your specific workload
The matrix is a starting point. Your actual tasks may have different requirements. Benchmarking tells you exactly which model works for your specific workload.
Step 1: Collect a representative task sample
Gather 5-10 tasks that represent your typical work. Include a mix of simple, moderate, and complex tasks. For each task, record the expected input and output.
Collect a representative sample of my recent tasks. Categorize them as simple, moderate, or complex based on the matrix above. Show me the distribution: how many tasks fall into each category?
Step 2: Run each task on candidate models
Run each task on the candidate models (phi4:latest, DeepSeek V3, Claude Sonnet). Measure success rate, quality, and time to completion.
Run my five most representative tasks on phi4:latest and DeepSeek V3. For each task, record whether it succeeded, the quality of the output (1-5 scale), and the time to completion. Show me the results side by side.
Step 3: Calculate cost-quality tradeoff
For each task category, calculate the cost per successful completion on each model. Factor in the quality score to get a cost per quality-adjusted completion.
Calculate the cost-quality tradeoff for my workload. For each task category, show me which model provides the best balance of quality and cost. Recommend a model routing strategy based on the results.
Model routing strategies
Once you know which models handle which tasks, implement routing to use the right model for each job.
Strategy 1: Category-based routing
Route tasks by category: simple tasks to phi4, moderate to DeepSeek V3, complex to Claude Sonnet. This requires categorizing tasks before they run, which can be done with a simple classifier in your agent prompt.
Implement category-based routing for my tasks. Create a simple classifier that reads the task description and assigns it to simple, moderate, or complex. Then route it to the appropriate model. Show me the classifier logic.
Strategy 2: Fallback chain routing
Start with the cheapest model that might work. If it fails or produces low-quality output, fall back to the next model. This minimizes cost while ensuring tasks eventually complete.
Set up a fallback chain for my tasks: try phi4 first, if it fails or produces low-quality output (based on a simple quality check), retry with DeepSeek V3. Show me the retry logic and quality check.
Strategy 3: Time-of-day routing
Route expensive models to off-hours when you are not actively working. During working hours, use cheaper models for background tasks.
Set up time-of-day routing. During my active hours (specify them), route background tasks to phi4. During off-hours, allow background tasks to use DeepSeek V3 if needed. Show me the time-based routing logic.
How model capabilities evolve over time
Model capabilities are not static. What was state-of-the-art six months ago may be mid-tier today. Keeping track of model evolution helps you know when to re-evaluate your routing strategy and potentially switch to a better model for the same tasks.
Signs a model has improved significantly: New version releases with major capability jumps, benchmark scores increase noticeably across multiple independent evaluations, community reports of better performance on tasks similar to yours, and the provider’s own documentation highlights new capabilities that match your use case.
Signs a model has been superseded by a better alternative: A newer model from the same provider offers better quality at the same price point, or similar quality at a meaningfully lower price. Community discussion shifts away from the old model toward the new one for your type of tasks. The old model stops receiving updates while the new one gets regular improvements and bug fixes.
Check the current state of the models I use. Have any of them released significant updates in the last three months? Are there newer models that might be better suited to my tasks at similar or lower cost? Provide a brief summary of any relevant changes that could affect my model selection decision.
Next steps after model selection
Once you have selected the right models for your tasks, implement the routing and monitor the results for at least one week. Adjust based on actual performance, not just benchmarks. After the initial tuning, schedule a quarterly review to ensure your selections remain optimal as models evolve and your workload changes.
Create a quarterly review reminder for my model selection. Set up a cron job that reminds me every three months to re-run the benchmarking and cost analysis to ensure I am still using the optimal models for my current workload.
Keep Reading:
Common questions
How do I know if a model is “good enough” for my tasks?
Run a benchmark with 5-10 representative tasks. If the model succeeds on 80%+ of tasks with quality scores of 4+ (on a 5-point scale), it is good enough. The remaining 20% can be handled by fallback to a more capable model. Perfection is expensive; “good enough” is the sweet spot for cost optimization.
My tasks vary widely in complexity. Should I use different models for different task types?
Yes. Categorize your tasks by complexity and route accordingly. Simple tasks (file reads, status checks) go to phi4. Moderate tasks (structured output, multi-step) go to DeepSeek V3. Complex tasks (research synthesis, long-form writing) go to Claude Sonnet. This tiered approach minimizes cost while ensuring quality where it matters.
How often should I re-evaluate my model choices?
Every 3-6 months, or when your workload changes significantly. New models are released, prices change, and your tasks may evolve. A quarterly check ensures you are still using the optimal model for each task category.
What if my tasks require capabilities that no single model provides?
Break the task into subtasks and route each subtask to the appropriate model. For example: research subtask to Claude Sonnet, data extraction subtask to DeepSeek V3, formatting subtask to phi4. This is more complex but can produce better results at lower cost than using a single expensive model for everything.
How do I handle model availability issues?
Set up a fallback chain. If your primary model is unavailable (rate limit, downtime), automatically fall back to the next model in the chain. For example: phi4 → DeepSeek V3 → Claude Sonnet. This ensures tasks complete even when your preferred model is temporarily unavailable.
Cost optimization examples
Example 1: Research assistant workload
Before optimization: All tasks run on Claude Sonnet. Monthly cost: ~$45.
After optimization: Simple tasks (file reading, status checks) moved to phi4. Moderate tasks (data extraction, summarization) moved to DeepSeek V3. Only complex research synthesis stays on Claude Sonnet. Monthly cost: ~$18.
Savings: 60% reduction with no noticeable quality drop.
Analyze my current workload. Categorize my tasks and estimate the monthly cost savings from moving appropriate tasks to cheaper models. Show me the breakdown.
Example 2: Automation-heavy workload
Before optimization: Cron jobs run on DeepSeek V3. Monthly cost: ~$12 for automation alone.
After optimization: All cron jobs moved to phi4. Monthly cost: $0 for automation.
Savings: $12/month with identical results for structured automation tasks.
Check my cron jobs. Which ones are running on API models that could run on phi4 instead? Estimate the monthly savings from switching them.
Implementation checklist
- ✅ Benchmark your tasks on candidate models
- ✅ Categorize tasks by complexity
- ✅ Set up model routing based on categories
- ✅ Configure fallback chains for availability
- ✅ Monitor cost and quality for one week
- ✅ Adjust routing based on monitoring results
- ✅ Schedule quarterly re-evaluation
Run through the implementation checklist for my setup. For each item, tell me what needs to be done and whether I have already done it.
