How to test whether a local model can handle your workload before committing

Switching OpenClaw to a local model saves money. It also breaks things in ways that are hard to predict without running a real test first. This openclaw local model benchmark article is a repeatable pre-flight checklist for evaluating whether a local model (Ollama, LM Studio, or any other local inference server) can actually handle what you throw at it before you commit to the switch.

TL;DR

  • The test: Three benchmarks (structured output, tool call parsing, and multi-step reasoning) run against your actual workload before switching.
  • The threshold: 90% pass rate on all three. Below that, route that task class to a capable model instead.
  • The goal: Know exactly which task types a local model handles before you find out the hard way.

Why the switch fails without a test

Most local model failures are not obvious. The model does not crash or return an error. It returns something that looks reasonable but is slightly wrong: a JSON object with a missing field, a tool call with the wrong argument, a summary that drops a key fact. These failures compound. One bad tool call leads to a retry, which leads to extra API calls, which leads to a bill that costs more than the frontier model would have.

The three failure modes that cause the most damage in practice:

  • Structured output failures. The model ignores the schema and returns freeform text. OpenClaw’s parser fails, the tool call errors, and the agent retries with the frontier model anyway.
  • Tool call hallucinations. The model invents a tool name that does not exist or calls a real tool with wrong arguments. The agent either errors out or executes something unintended.
  • Context truncation. The local model has a shorter context window than you realize, silently drops the second half of a long conversation, and produces a response that ignores recent instructions.

All three are testable before you switch. The benchmarks below surface them.

Before you run any test

Get your baseline first. Ask your agent what it is currently doing and how much it costs:

Look at my last 30 conversations. Count how many involved: (1) tool calls with structured JSON output, (2) multi-step reasoning with 3 or more steps, (3) simple Q&A or short responses under 200 words. Give me the breakdown as percentages. Then tell me which local model I currently have configured and its context window size.

That breakdown determines which benchmarks matter most for your specific setup and saves you from running tests that do not apply to the work your agent actually does. If 80% of your interactions are simple Q&A, a local model will likely handle them fine. If 60% involve tool calls, the tool call benchmark is critical.

Context window gotcha

Many local models advertise an 8k or 16k context window. OpenClaw injects system prompts, memory blocks, and tool definitions before your first message. On a typical setup, that overhead alone consumes 4,000 to 8,000 tokens. A model with an 8k window may have fewer than 1,000 tokens left for the actual conversation. Check this before testing.

Check my current system prompt length in tokens. Add the memory injection overhead if memory is enabled. Add the tool definitions overhead. Tell me the total token overhead and how much context window remains for the conversation when using a local model with an 8k context window.

Benchmark 1: How to test local model structured output

This tests whether the local model can follow a JSON schema reliably. OpenClaw’s memory extraction, tool results, and several internal operations depend on structured output. A model that cannot hold a schema under pressure will cause silent failures throughout the pipeline.

Run this test five times and count failures:

Switch your model temporarily to the local model you want to test. Then extract structured data from this paragraph and return it as valid JSON with these exact fields: title (string), date (ISO 8601), participants (array of strings), outcome (string), action_items (array of strings). Paragraph: “On March 15th, Sarah Chen and Marcus Webb met to review the Q1 deployment schedule. They agreed to push the gateway hardening update to March 22nd and assigned the config audit to Sarah. Marcus will handle the rollback testing.” Do not include any text outside the JSON object.

A passing result is a valid JSON object with all five fields populated correctly. Any extra text outside the JSON, any missing fields, or any malformed values counts as a failure. Four out of five passing is the minimum threshold. Below that, the model cannot handle structured output reliably enough for production use.

Common failure signatures to watch for:

  • Markdown fencing. The model wraps the JSON in a code block instead of returning raw JSON. OpenClaw’s parser chokes on the backticks.
  • Extra commentary. The model adds “Here is the JSON:” before the object or “Let me know if you need changes” after it. Both break parsing.
  • Wrong field types. participants as a string instead of an array, date as “March 15th” instead of ISO format. These look correct on screen but fail validation downstream.
  • Missing fields. The model decides action_items is empty and omits the key entirely instead of returning an empty array.

If you see any of these on more than one run out of five, structured output routing will cause problems in production. Note specifically which failure type appears most often. That tells you whether the issue is format compliance, field completeness, or type handling.

Benchmark 2: Tool call parsing

This tests whether the model can correctly invoke tools with the right arguments in the right format. Tool call failures are the most common source of unexpected agent behavior after switching to a local model.

With the local model active, run three tasks in sequence: (1) Read the file MEMORY.md and summarize the Infrastructure section in three bullet points. (2) Search the web for “OpenClaw latest release” and return the version number and release date. (3) Write a file called test-output.txt with the text “benchmark test passed” and confirm it was written by reading it back. Report whether each task completed successfully or produced an error.

All three should complete without errors. If any fails, note which one and what the error was:

  • File read failure usually means the model is hallucinating tool syntax: calling a tool that does not exist or passing the path argument with the wrong key name.
  • Web search failure usually means the model cannot format the query argument correctly or invents a search tool name that does not match the registered tool.
  • Write-and-verify failure usually means the model lost track of the task mid-way and either skips the verification step or reads the wrong file.

What to look for: The model should use the correct tool names (read, web_search, write) with correct argument formats. If you see tool names that do not exist, arguments in the wrong format, or the model asking you to run commands manually instead of using tools, those are all failures. A model that says “you can run: cat MEMORY.md” instead of calling the read tool has failed the benchmark even if the instruction was technically correct.

If the tool call benchmark fails, do not assume the model is useless. Tool call quality varies significantly by tool schema complexity. Run a follow-up test with only the simplest tool in your setup (usually a file read) to find the floor. Some local models handle simple reads reliably but break on tools with nested argument schemas like memory recall or exec.

Benchmark 3: Multi-step reasoning under constraint

This tests whether the model can hold a constraint across multiple reasoning steps. This is where most local models fall apart on complex tasks. The failure is usually silent: the model produces a plausible-looking answer that violates one of the constraints without flagging the violation.

With the local model active, solve this planning problem step by step. Constraints: (1) Task A must complete before Task B starts. (2) Task C can run in parallel with Task B but not with Task A. (3) Task D requires both B and C to be complete. (4) The total timeline must not exceed 6 hours. Tasks: A takes 2 hours, B takes 3 hours, C takes 2 hours, D takes 1 hour. Give a valid schedule with start and end times for each task. Then explain why each constraint is satisfied.

The correct answer: A starts at hour 0 and ends at hour 2. B and C both start at hour 2. B ends at hour 5, C ends at hour 4. D starts at hour 5 and ends at hour 6. The explanation should confirm that B and C overlap, that D waits for both, and that the 6-hour constraint is met exactly.

Common failure patterns:

  • C starts before A finishes. The model misreads constraint 2 and runs C in parallel with A, violating the “not with Task A” part.
  • D starts too early. The model starts D when C finishes (hour 4) without waiting for B to finish (hour 5), violating constraint 3.
  • Correct schedule, wrong explanation. The model gets the times right but explains the constraints incorrectly. This is still a failure: it means the model cannot verify its own reasoning.
  • Overtime. The model adds padding to each task and the total runs to 7 or 8 hours, violating constraint 4.

Run this benchmark twice. Local models sometimes get the right answer by pattern-matching rather than reasoning, and the second run with slightly rephrased constraints reveals whether the reasoning was genuine or coincidental.

Reading the results

After running all three benchmarks, you have a clear picture of what the local model can and cannot handle:

Decision matrix

  • All three pass: The local model is ready for production on those task types. Switch it and monitor for two weeks.
  • Structured output fails, others pass: Use the local model for simple Q&A and short responses. Route anything involving memory extraction or tool results to a capable model.
  • Tool calls fail, others pass: The local model can handle reasoning and simple chat. Do not route any task that requires tool use to it.
  • Multi-step reasoning fails: Restrict the local model to heartbeats, status checks, and single-step tasks. Anything requiring more than two reasoning steps needs a capable model.
  • Two or more fail: The local model is not ready for this workload. Use it only for free-tier tasks (heartbeats, HEARTBEAT_OK pings) where quality does not matter.

Setting up routing after the test

Once you know which tasks the local model can handle, set up routing to match. The goal is automatic task-class routing, not manual model selection per conversation. Manual selection fails when you forget, when the agent runs overnight, or when a cron job fires at 3am.

Based on these benchmark results: [paste your results here]. Suggest a model routing configuration that routes simple tasks and heartbeats to the local model, tool-heavy tasks to deepseek/deepseek-chat, and complex multi-step tasks to anthropic/claude-sonnet-4-6. Show me the exact config block. Do not apply it yet.

The review step before applying is not optional. A routing config that sends the wrong task class to the wrong model is harder to diagnose than the original problem. Review the suggested config against your benchmark results before committing it.

One thing to check in the suggested config: make sure the fallback chain is explicit. If the local model fails a task at runtime, OpenClaw needs to know which model to fall back to. A config without a fallback will error rather than recover, which is worse than not routing at all.

After you switch: what to watch

The benchmarks tell you what the model can do in isolation. Production reveals what it does under load, across long sessions, and with your specific system prompt and memory injection. Monitor three signals for the first two weeks:

  • Retry rate. If the local model is failing tasks and the agent is retrying with the frontier model, your API bill will go up, not down. Check the agent activity log weekly for the first month.
  • Memory extraction quality. Memory extraction uses structured output under the hood. If memories are getting garbled or incomplete after the switch, the local model’s structured output is failing silently in production while the benchmark passed in testing.
  • Response latency. Local models running on CPU or insufficient VRAM are slow. If responses are taking more than 30 seconds, the local model is a bottleneck, not a cost saver. Check latency before concluding the switch worked.

Check my agent activity log for the last 7 days. Count: how many tasks were retried after a local model failure, how many memory extractions completed successfully versus failed, and what the average response time was for the local model versus the primary model. Report any signals that suggest the local model is underperforming.

Common failures after switching

The model passes benchmarks but fails in production

Benchmark tests run in a clean context with minimal overhead. Production runs in a loaded context with the full system prompt, injected memories, tool definitions, and conversation history already consuming tokens before the first message. The extra tokens push the model past its effective context window and quality drops in ways that did not appear in the benchmarks. Check the actual token count at the start of a production session before concluding the benchmarks were wrong.

The model works for two weeks then starts degrading

This is context accumulation. As the session grows longer, the effective context available to the local model shrinks turn by turn. The model starts dropping earlier instructions and begins producing responses that contradict things you established earlier in the conversation. Fix: check your compaction settings and reduce context retention if needed, or reduce the session length threshold that triggers compaction.

Tool calls work in testing but fail for specific tools

Some tools have more complex argument schemas than others. Memory tools, exec, and web_search all have schemas that local models handle inconsistently. If a specific tool is failing while others work, route tasks that require that specific tool to a more capable model rather than changing the entire routing config. Surgical routing beats wholesale switches.

The model handles individual benchmarks but fails when they combine

A task that requires structured output, tool calls, and multi-step reasoning simultaneously is harder than any individual benchmark suggests. If your workload regularly combines all three, run a combined benchmark that asks the model to complete a single task requiring all three before committing to the switch. A model that passes each benchmark individually but fails the combined test is not ready for that task class in production.

Latency is acceptable at first but gets worse over time

Local model latency increases with context length. A model that responds in 5 seconds at the start of a session may take 40 seconds by turn 20. This is not a model quality issue, it is a hardware issue: the model has to process more tokens on each turn as the session grows. If you see this pattern, check whether your hardware has enough VRAM to hold the full model in GPU memory. CPU inference degrades much more sharply with context length than GPU inference.

Hardware check: does your server have enough to run the model?

Running a benchmark against a model that is already resource-constrained produces misleading results. If the model is swap-paging during the benchmark, it will fail in ways that have nothing to do with model quality. Check hardware capacity first.

Check the current memory usage on this server. How much RAM is free? How much swap is in use? Is Ollama currently loaded? Which model is loaded and how much VRAM or RAM is it consuming? Tell me whether the server has enough headroom to run a 14B parameter model at Q4 quantization without paging to disk.

The numbers to know before you start:

  • 7B model at Q4: requires approximately 4 GB of RAM or VRAM. Runs on CPU but is slow. GPU inference at this size is fast enough for production.
  • 14B model at Q4: requires approximately 8 GB. This is the size range where CPU inference becomes a serious bottleneck. A 14B model on CPU takes 15 to 40 seconds per response depending on hardware. That is acceptable for background tasks but not for interactive use.
  • 32B model at Q4: requires approximately 18 GB. Not practical on a standard 4 GB or 8 GB VPS. This tier requires a dedicated GPU server or a machine with substantial RAM and tolerance for slow inference.

If your server does not have enough headroom, the benchmark results will not represent production performance. Add the model, check latency on a single prompt, and confirm the server is not paging before running the full benchmark suite.

Mapping benchmarks to your specific workload

The three benchmarks above test general capability. Your workload may have specific requirements that need targeted testing. Common OpenClaw workloads and what to test for each:

Cron-heavy setups

If most of your agent’s work happens on a schedule (daily briefings, weekly digests, automated research, monitoring checks), the key benchmarks are structured output and multi-step reasoning. Cron tasks typically do not require complex tool chains, but they do require the model to follow a consistent output format across many runs. Run the structured output benchmark 10 times instead of 5 for cron workloads. Consistency matters more than single-run quality when you are running the same task daily.

Simulate a typical cron task with the local model active: summarize my last 5 workspace files changed today, format the output as a bulleted list with one sentence per file, and send it as a Telegram message. Report whether the task completed, how long it took, and whether the output format was correct.

Memory-heavy setups

If you have autoCapture and autoRecall enabled, the local model will be involved in memory extraction. Memory extraction is a structured output task with a specific schema. It runs silently in the background after each conversation. If it fails, you do not get an error message. You get memories that are incomplete, duplicated, or categorized wrong. The only way to catch this in testing is to run a targeted memory extraction benchmark.

With the local model active, extract memories from this conversation excerpt and return them as a JSON array. Each memory should have these fields: text (string), category (one of: preference, fact, decision, entity, reflection, other), importance (float 0 to 1), scope (string). Excerpt: “Ghost prefers direct answers without preamble. She is building a content pipeline for redrook.ai. The pipeline runs 7 passes per article. The target word count per article is 40,000 characters. She uses DeepSeek Chat as the primary model. The Cloudflare zone ID for redrook.ai is cd62aaa7ecdc9ef80eaa8bf56b845c2b.” Return only the JSON array with no surrounding text.

A passing result has all six items extracted with correct categories, reasonable importance scores, and no extra text outside the JSON array. If the local model returns fewer than four of the six items, or wraps the JSON in markdown, or adds commentary before or after the array, memory extraction will degrade silently in production.

Research and writing setups

If your agent does a lot of web research, content drafting, or document summarization, the key benchmark is multi-step reasoning combined with long output. Many local models handle short responses well but degrade significantly on outputs over 1,000 words. Test for long-form output quality specifically.

With the local model active, write a 600-word explanation of how OpenClaw’s model routing works, written for someone who has never used OpenClaw before. Cover: what model routing is, why it matters for cost, how fallback chains work, and one example of a well-configured routing setup. Use plain language. No bullet points. No headers. Prose only.

Read the output carefully. Long-form degradation in local models typically shows up as: sentences that start repeating themes from earlier in the output, conclusions that contradict the setup, or the model stopping mid-explanation and adding “I hope this helps” style closings before the 600-word mark. Any of these signals indicates the model is losing coherence at longer output lengths and is not suitable for research or writing tasks in production.

How to interpret partial failures

Most real-world benchmark results are not clean passes or clean failures. They are mixed: the model passes structured output but fails one out of five tool call tests, or it passes reasoning but the output degrades on the combined test. Here is how to interpret the most common mixed results:

One out of five structured output runs fails

An 80% pass rate on structured output is borderline. For low-stakes tasks like formatting a summary, an 80% rate is acceptable if failures are caught and retried. For tasks where a bad JSON object causes downstream data corruption, 80% is not acceptable. Decide based on what happens when the extraction fails in production, not on the pass rate in isolation.

Tool calls work but with wrong argument names

This is a common failure mode for models fine-tuned on a different tool schema than the one OpenClaw uses. The model knows the concept of tool calling but uses different argument names. The fix is usually a system prompt addition that explicitly lists the correct tool names and argument formats. Test whether adding that context to the system prompt closes the gap before deciding the model cannot handle tool calls at all.

Reasoning passes in isolation but fails combined with tool calls

The model is hitting its effective context limit during the combined task. The reasoning benchmark and the tool call benchmark each use relatively little context. When combined, the overhead from tool definitions, tool results, and conversation history pushes the model past the point where it can hold all constraints in working memory simultaneously. The fix is to reduce context overhead before switching, not to switch to a different model.

Results vary significantly between runs

High variance across benchmark runs on the same prompt usually means the model’s temperature setting is too high, the model is quantized at a level that introduces random degradation, or the hardware is thermally throttling during later runs. Check the Ollama temperature setting and run the benchmarks when the server is not under other load before drawing conclusions from variable results.

Final checklist before switching

Run through this checklist before committing to the local model switch:

  • Hardware verified: Server has enough RAM or VRAM to run the model without paging. Latency on a single test prompt is under 30 seconds.
  • Structured output: Four out of five runs pass with no extra text, no missing fields, and correct field types.
  • Tool calls: All three tool call tasks complete without errors. No hallucinated tool names. No manual instruction fallback.
  • Multi-step reasoning: Correct schedule produced on both runs. All four constraints correctly identified in the explanation.
  • Workload-specific benchmark: Whichever additional benchmark applies to your setup (cron, memory extraction, long-form output) passes at the same threshold.
  • Routing config reviewed: Config reviewed against benchmark results before being applied. Fallback chain is explicit.
  • Monitoring set up: Agent activity log review scheduled for the first two weeks after switching. Retry rate, memory extraction quality, and response latency all have baselines to compare against.

If every item on the checklist is confirmed, the switch is ready. If any item is not confirmed, note which one and decide whether to fix the gap, accept the risk, or restrict the local model to only the task classes it passed.

Model-specific notes for common local models

These notes are based on running the three benchmarks above against the models most commonly used with OpenClaw as of March 2026. They are not a substitute for running the benchmarks yourself against your specific workload, but they give you a starting point for what to expect.

phi4:latest (14B Q4_K_M)

Structured output: passes reliably on standard schemas. Fails occasionally on nested arrays inside arrays. If your memory extraction schema uses nested structures, test that specific schema rather than the flat example in the benchmark.

Tool calls: inconsistent on web_search. Passes read and write reliably. If your workload is primarily file operations and does not depend on web search in the critical path, phi4 is a reasonable choice for that task class.

Multi-step reasoning: passes the benchmark reliably at 14B Q4. Degrades on problems with more than four simultaneous constraints. For most OpenClaw cron and briefing workloads, this is not a limitation in practice.

Latency on a 4 GB VPS with CPU inference: 20 to 35 seconds per response. On a machine with 8 GB RAM and no GPU: 15 to 25 seconds. On a machine with dedicated GPU (RTX 3060 or better): 3 to 8 seconds. Check latency on your specific hardware before routing interactive tasks to this model.

llama3.1:8b

Structured output: passes on simple flat schemas consistently. Fails on schemas with more than four fields, and fails frequently when the output must be a JSON array instead of a JSON object. Not reliable enough for memory extraction in production.

Tool calls: passes file read and write consistently. Fails on tools with more than two required arguments. The web_search and exec tools both have schemas that cause frequent failures at this model size.

Multi-step reasoning: fails the standard benchmark approximately 40% of the time. Usable for single-step reasoning tasks (summarize this, classify this, reformat this) but not for planning tasks or tasks requiring constraint tracking.

Best use case for llama3.1:8b in OpenClaw: heartbeats, HEARTBEAT_OK pings, simple file reads, and single-step summarization. Not suitable for memory extraction, tool chains, or any task requiring multi-step reasoning. At this scope, it costs nothing in API fees and responds fast enough for background operations.

qwen2.5-coder:7b

Structured output: passes reliably on schemas that resemble code or config formats. Struggles with schemas that require natural language judgment, such as importance scoring or category classification for memory extraction.

Tool calls: passes code-adjacent tools (read, write, exec) reliably. Fails on tools that require natural language query construction like web_search or memory recall.

Multi-step reasoning: passes on technical problems (debugging sequences, build pipelines, dependency resolution). Fails on non-technical reasoning tasks.

Best use case for qwen2.5-coder in OpenClaw for ollama openclaw workload tasks: code review, script generation, debugging assistance, and file processing tasks with structured input and output. Not suitable for general-purpose reasoning or interactive conversation.

When a local model is the wrong tool entirely

The benchmarks above tell you whether a local model can handle your workload. There is a separate question worth asking first: is cost the actual problem, or is something else going on?

Local models make sense when API costs are high relative to the value of the tasks being run. They do not make sense when the bottleneck is something else entirely. Before running any benchmark, ask your agent for a cost breakdown by task type. If 80% of your API spend is coming from one expensive task (a long research pipeline, a large document analysis, a high-frequency cron job that calls a frontier model), that specific task is the optimization target. Switching the whole workload to local inference to address one expensive task is a larger change than necessary and introduces more risk than it saves.

The right path in that case is routing that one task class differently, not running a full benchmark suite on a local model you may not need. Run the cost audit first. The benchmark suite only makes sense after you have confirmed that broad local model routing is actually the right cost reduction strategy for your specific usage pattern. Benchmarks take time to run and routing configs take time to get right. Spend that time on the change that addresses the actual cost driver, not on the most visible change available.


Common questions

A model passes all three benchmarks but fails in production. How is that possible?

Benchmarks test individual capabilities in isolation. Production tasks often combine multiple capabilities simultaneously: a cron job that calls a tool, parses the result, writes a memory, and sends a Telegram message combines tool calling, structured output, and multi-step reasoning in one sequence. Models that handle each individually sometimes fail the combination. The benchmark is a floor check, not a ceiling guarantee. If a model passes benchmarks but fails production tasks, add a combined benchmark that runs a representative full workflow end-to-end.

My local model works perfectly for two weeks then starts degrading. What causes that?

The most common cause is context accumulation in long-running sessions. As the session context grows, the model has more to process and its effective capability for complex tasks decreases. This is not a model failure but a context management issue. The fix is compaction configuration or session restart protocols, not switching models. Check your compaction settings and confirm LCM is running before concluding the model itself is degrading.

How do I know which local model is the right fit for my specific cron setup?

Map your cron tasks by complexity tier: simple (file read, status check, single notification) versus moderate (multi-tool, structured output, memory write) versus complex (research, writing, multi-step with branching). Local models handle simple reliably, moderate with tuning, and complex poorly. If most of your cron tasks are simple, llama3.1:8b is the right choice. If a mix of moderate and simple, phi4 for the moderate tasks with llama3.1:8b for simple. Complex cron tasks should stay on API models regardless of cost concerns.

Can I run different local models for different cron jobs on the same instance?

Yes. OpenClaw’s model routing lets you specify a model per job in the cron payload. Set simple jobs to ollama/llama3.1:8b and complex jobs to ollama/phi4:latest or a fallback API model. Ollama loads models on demand and unloads them based on keep-alive settings, so running two models does not require twice the RAM simultaneously as long as they are not called at the same time.

What is the minimum RAM needed to run phi4:latest reliably?

phi4:latest (14B Q4_K_M) requires about 9GB of RAM to load. On a server with 16GB total, that leaves limited headroom for the operating system, OpenClaw gateway, Ollama, and any other processes. If your server has exactly 16GB, monitor memory usage closely after switching. Servers with 24GB or more run phi4 comfortably without memory pressure.

Keep Reading:

Cheap ClawHow to set hard spend limits on OpenClaw API usageConfigure spend monitoring and self-imposed API cost controls so you never get surprised by a bill.Cheap ClawUnexpected API calls are driving up my bill overnightHow to find and stop the cron jobs and background tasks quietly burning API budget.Cheap ClawHow to get prompt cache hits on every API callStructure your system prompts to maximize cache reuse and cut per-call token costs.

Go deeper

Model RoutingMy OpenClaw responses got worse after switching to a cheaper modelDiagnose why quality dropped and which tasks to reroute.Model RoutingOpenClaw keeps switching models mid-task and I don’t know whyThe five causes of automatic model switching and how to stop each one.CostWhich OpenClaw features cost money and which are completely freeThe full cost breakdown before you optimize anything.