How to reduce system prompt size without losing context

Every OpenClaw session starts the same way: the system prompt loads before your first message. It is the invisible foundation everything else sits on. For many operators it is also the largest single consumer of context tokens in the entire session. A system prompt that has grown unchecked through months of adding files, rules, and plugins can consume 30,000 to 60,000 tokens before you type a single word. Those tokens are paid for on every turn, every session, every cron job that fires. This article covers what is actually inside your system prompt, which parts are weight you can cut, and the structural patterns that eliminate bloat without losing the context your agent actually needs.

TL;DR

  • Your system prompt is not one file. It is the sum of everything that loads at session start: base instructions, all workspace files, memory injections, plugin context blocks, and channel metadata. Most operators have never measured the total.
  • Audit before cutting. Run the token count first. Know which sections are expensive before deciding what to trim.
  • Move rarely-needed context to on-demand reads. Files your agent needs once a week should not load on every session start.
  • Tighten memory injection limits. autoRecall injects memories on every turn. Too many, too loose, and the injected block costs thousands of tokens per session.
  • Trim the files themselves. Most AGENTS.md and SOUL.md files accumulate cruft. Old rules that no longer apply, redundant sections, notes that were meant to be temporary. Cut them.
  • Over-trimming has consequences. Know what you are removing before you remove it. Some context that looks redundant is actually load-bearing.

Throughout this article you will see indented blocks like the ones below. Each one is a command you can paste directly into your OpenClaw chat. Your agent will run it and report back. You do not need to open a terminal or edit any files manually.

When people say “system prompt” they usually mean one file. In OpenClaw, the system prompt is assembled from multiple sources on every session start. Understanding what goes in helps you understand what you can trim and what you cannot.

Base instructions from OpenClaw itself. The platform injects a base system prompt covering tool usage, safety guidelines, formatting rules, and capability descriptions. This is not user-configurable in the typical sense, but it is a fixed token cost on every session.

Workspace context files. Any file listed in your workspace context config gets read and prepended at session start. For most operators this includes SOUL.md, AGENTS.md, USER.md, TOOLS.md, MEMORY.md, INFRASTRUCTURE.md, and others added over time. Each file adds tokens. A typical AGENTS.md that has grown over several months of use can run 5,000 to 15,000 tokens on its own.

Memory plugin injections. If you have autoRecall enabled, the memory plugin queries your memory store on each turn and injects matching memories into the context. Depending on your settings, this block can range from a few hundred tokens to several thousand per turn. It is not a one-time load at session start. It compounds on every single turn in the session.

Plugin context blocks. Some plugins inject their own context at session start: channel metadata, node status, conversation history from external platforms, active cron job listings, and others. Each plugin that does this adds to the system prompt total without any visible line in your workspace files.

Channel and routing metadata. Information about the current channel, sender identity, and available capabilities is injected by OpenClaw automatically. This is a small but non-zero cost.

Estimate the total token count of my current system prompt. Break it down by source: base platform instructions, each workspace file loaded at session start, memory injections (per turn average), and any plugin context blocks. Give me an approximate total and rank the sources by size.

This number is the baseline you are working from. Do not guess. Run it before deciding anything.

The immediate reason most people care about system prompt size is compaction. If your prompt consumes 40,000 tokens and your context window is 60,000, you have 20,000 tokens of working space before compaction fires. That sounds like enough until you factor in a few file reads and a tool call or two.

But system prompt size has three other effects worth understanding:

Cost per turn. You pay for input tokens on every model call. Your system prompt is re-sent as input on every turn. A 40,000-token system prompt on a session with 20 turns costs you 800,000 input tokens just from the prompt, before any actual conversation. On Claude Sonnet at $3 per million tokens, that is $2.40 per session in system prompt costs alone. On a model like DeepSeek at $0.28 per million tokens, the same session costs $0.22. The model you choose multiplies the system prompt cost directly.

Response latency. Models process all tokens in context before generating each reply. Larger context means longer time-to-first-token. The effect is subtle on short prompts and significant on long ones. If responses feel slow on the first turn before any history has built up, the system prompt size is the most likely cause.

Cron job costs. Every cron job that fires starts a new session and loads the full system prompt. A queue processor that fires 48 times a day loads your entire system prompt 48 times. If that prompt is 40,000 tokens and you are using an API model, that cron job is burning nearly 2 million input tokens per day just in prompt overhead before it does any actual work. Multiply by several active cron jobs and the numbers get real fast.

Based on my system prompt token estimate: how much is my system prompt costing per day in input tokens, assuming my cron jobs fire at their current frequency and I have roughly 10 active chat turns per day? Use my current primary model’s pricing.

Before cutting anything, run a structured audit. You need to know which sources are expensive and which are not. Cutting the wrong thing wastes time and may break something. Cutting the right things can halve your system prompt size in under an hour.

Step 1: Measure each workspace file

List every file in my workspace that loads at session start. For each one, read it and give me an approximate token count. Sort them from largest to smallest. Flag any file where I have not referenced or updated its content in the last two weeks.

The files you have not touched in two weeks are your first candidates for either trimming or moving to on-demand loading. They are loading on every session even though you rarely need them.

Step 2: Read your AGENTS.md and SOUL.md as if you are an editor

These files grow by accretion. Every time you add a rule, it stays. Every note that was meant to be temporary becomes permanent. Most AGENTS.md files in active use contain at least 30% cruft by the time they are six months old: rules that contradict newer rules, sections that describe behaviors you changed, warnings about bugs that were fixed, and notes written for a specific task that were never removed.

Read my AGENTS.md. Flag every section that meets any of the following criteria: (1) it describes a rule that is already covered by another section, (2) it references a tool, plugin, or config state that has since changed, (3) it was written for a specific task and does not apply to general operation, or (4) it is a warning about something that no longer exists. Do not edit anything yet. Just flag and explain.

Read my SOUL.md. What sections are essential to how you operate day-to-day? What sections have not influenced a response in recent memory? What could be condensed to half its current length without losing meaning?

Do not act on these findings yet. Collect them first. Review the full list before cutting anything.

Step 3: Audit memory injection settings

autoRecall is one of the highest-value features in the memory system. It is also one of the stealthiest token consumers. Before you can do anything about it, you need to know how your plugin injects memories, because the two injection patterns have very different cost profiles.

The two injection patterns

User message prefix injection is the most common pattern. Every time you send a message, the plugin prepends a block of recalled memories to it before it reaches the model. That block becomes part of the conversation turn. It gets stored in the conversation history. It is still there on turn 5, turn 10, turn 20. The cost is not flat per session. It compounds: each turn adds another injection block to the persistent history, and all of the previous blocks stay in context. A 20-turn session with 3,000-token injections per turn carries 60,000 tokens of memory blocks in history by the end, plus whatever the conversation itself contains.

System prompt layer injection injects memories into the system prompt position rather than the conversation history. In this pattern, the memory block is refreshed each turn but does not accumulate. Turn 20 has the same memory overhead as turn 1. The cost is flat: one injection block, replaced on each turn rather than stacked. This is significantly cheaper for long sessions.

To find out which pattern your setup uses:

Check my memory plugin config. Does autoRecall inject memories as a prefix to user messages, or does it inject into the system prompt layer? I want to know whether the injected blocks accumulate in conversation history or get replaced each turn.

The cost math, concretely

If your plugin uses user message prefix injection and your limit is set to 10 memories averaging 200 tokens each, you are adding 2,000 tokens of memory blocks per turn. Across a 15-turn session, that is 30,000 tokens of injections in conversation history, not counting the injections themselves in each turn’s input. On Claude Sonnet at $3 per million input tokens, that is $0.09 per session just in memory injection overhead. Across 30 sessions in a month, that is $2.70 from memory injection alone, before any other costs. It adds up faster than it looks, especially with higher limits or longer sessions.

If your plugin uses system prompt layer injection, the math is different: 2,000 tokens per turn as a constant overhead, not a growing one. A 15-turn session costs the same 2,000 tokens per turn whether it is turn 2 or turn 15.

The levers

Regardless of injection pattern, the two settings that control cost are the result limit and the similarity threshold.

Read my memory plugin config. What is the maximum number of memories returned per autoRecall query? What is the similarity threshold currently set to? What is the approximate average token length of my stored memories? Based on those numbers and the injection pattern, estimate what memory injection is costing per session.

Result limit: 5 to 8 memories per turn is enough for useful recall in most setups. If you are set to 15 or 20, you are injecting a large block to retrieve marginal relevance from the lower-ranked results. Cut the limit to 8 and watch whether recall quality actually changes. It usually does not.

Similarity threshold: A higher threshold means the plugin only injects memories that are closely matched to the current message. A loose threshold sweeps in loosely-related memories to fill the result limit. Tightening it reduces the average injection size by retrieving fewer but more relevant results, without changing the limit at all. Start at 0.75 if you are currently below that.

If your plugin supports system prompt injection and you are currently using message prefix: switching to system prompt injection is the highest-impact change available. It converts a compounding cost into a flat one. Whether this switch is available depends on your specific plugin configuration.

This is the highest-impact structural change you can make. Instead of loading all context files at session start, load them when needed.

The pattern works like this: your AGENTS.md or a startup instruction tells your agent that certain files exist and what they contain, but does not load them directly. When a task comes in that needs that context, the agent reads the file then. When the task is done, the file content falls off as it compacts or is no longer referenced.

Files that are good candidates for on-demand loading:

  • Project files. If you have a project that is active for a week then goes quiet, its project file does not need to be in the session-start context. A one-line reference in AGENTS.md (“project files are in memory/projects/ — load the relevant one when working on a project”) is enough.
  • MEMORY.md and INFRASTRUCTURE.md. If these files change rarely, they do not need to reload every session. A brief summary section at the top of AGENTS.md with a pointer to the full file is often sufficient for most sessions.
  • Research and reference files. Any file that was generated for a specific task (a research report, a competitor analysis, a product brief) should never be in the session-start context unless you are actively working on that task right now.
  • Tools and skills documentation. If your TOOLS.md is a detailed reference document, most of it is not needed on every turn. A two-line summary of available tools and a pointer to the file for details works for most sessions.

Look at the full list of files that load at my session start. For each one, tell me: how often does a typical session actually need the content in this file? Would a brief summary plus a “read this file when needed” instruction work instead of loading the full content? Which files are strong candidates for on-demand loading?

The goal is not to reduce what your agent knows. It is to defer loading context until the moment it is actually needed rather than preloading everything speculatively.

Beyond moving files to on-demand loading, the files that do stay in the session-start context should be as lean as possible. This is the editing pass.

Cut redundancy ruthlessly

Rules that appear in both SOUL.md and AGENTS.md should appear in one place. Pick the one that is the most appropriate home and delete the copy. The same applies to model routing rules, communication preferences, budget guidelines, and protocol descriptions. These tend to replicate across files as the workspace grows.

Replace narrative with bullets where possible

A paragraph explaining a rule uses more tokens than a bullet that states the same rule directly. “When approaching any task, Gambit should first consider whether the approach aligns with the governing priority order, which places cost reduction above all other concerns, followed by security hardening, then autonomy improvements, and finally monetization-related activities” is about 60 tokens. “Priority order: (1) cost (2) security (3) autonomy (4) monetize” is 12 tokens and conveys the same rule. The model does not need prose explanation to follow a rule. It needs the rule stated clearly.

Remove historical notes and incident logs

AGENTS.md often accumulates notes like “resolved 2026-03-18: the autoRecall conflict was caused by X” or “do not use Y because of the incident on date Z.” Once a bug or issue is resolved and no longer active, its documentation does not need to live in the session-start context. Move it to an archive file or delete it. The resolution matters when you are debugging. It does not need to be loaded on every turn of every session forever.

Audit the Known Issues sections

These sections are worth special attention. They tend to grow: every issue gets added, resolutions sometimes get noted but the issue entry stays. Walk through each known issue entry and ask: is this still an active concern? If the underlying bug was fixed or the workaround was incorporated into normal operation, remove the entry entirely.

Read my AGENTS.md. Identify every historical note, resolved issue, incident reference, or changelog entry. These are things like “resolved on date,” “fixed after the X incident,” “workaround applied,” or sections that describe past states of config or plugins that have since been updated. List them all. Do not edit yet.

Now do the same for SOUL.md and MEMORY.md if those files load at session start. What historical content exists in each? What would be safe to remove or archive without affecting day-to-day operation?

Over-trimming is a real risk. Some context that appears redundant or rarely referenced is actually load-bearing in ways that are not obvious until it is gone.

Model routing rules. These need to be in the session-start context. If your agent does not know at session start which model to use for which type of task, it will default to whatever the config says, which may not be what you want. This content looks short but saves money every day. Keep it.

Red lines and failure protocols. Rules about what the agent should never do, what to do when blocked, and what requires explicit approval before acting. These need to be present at every turn, not just loaded on demand. They are the safety layer. Trimming them to save tokens is a false economy.

Active project state. If you are in the middle of a multi-day project, the project context file needs to load. This is the exception to the on-demand rule. Active project state is worth the token cost because it prevents your agent from starting each session cold on work-in-progress.

Communication routing. Channel IDs, contact information, and notification targets. If these are missing, your agent cannot reach you when something goes wrong. Worth the tokens.

Before I approve any changes to my system prompt files: which sections are safety-critical or operationally essential and should not be removed regardless of size? Flag anything that would cause real problems if it were missing from session-start context.

After making changes, measure. Do not assume the changes worked.

Start a new session and estimate the total system prompt token count again using the same breakdown as before. Compare it to the baseline. What is the reduction? Did compaction behavior change? Did anything break or feel different in how you are operating?

A 20% reduction in system prompt size is a reasonable target for a first pass. A 40% reduction is achievable if the files have been growing for more than a few months without any editing. Numbers above 50% are possible but usually require moving significant context to on-demand loading rather than just trimming existing files.

Run one full day of normal sessions before declaring the reduction successful. Watch for: unexpected agent behavior that might indicate missing context, compaction firing earlier or later than before, and any cron job outputs that seem to miss context they previously had.

System prompt trimming is part of a larger configuration audit. Context window sizing, compaction thresholds, security exposure, and model routing all interact. The complete guide covers all of them together with the agent prompt that does the audit automatically and proposes the specific changes for your setup.

Complete guide

Brand New Claw

The full production configuration audit. Context sizing, compaction tuning, system prompt optimization, security hardening, and model routing. Drop it into your agent and it audits your current setup and proposes the changes ranked by impact.

Get it for $37 →

Advanced compression techniques

Once you have completed the basic audit, these techniques produce additional reductions for operators who want to push further.

Inline code blocks versus descriptive text

Descriptive text explaining what a command does costs tokens. A labeled code block with a comment costs fewer. Compare:

Before: “To restart the gateway after a config change, navigate to the openclaw directory and run the restart subcommand.”

After: openclaw gateway restart # after config changes

The code block version is shorter, more readable, and easier for the agent to act on. Apply this pattern anywhere instructions describe a command rather than just showing it.

Scan my AGENTS.md for instructions that describe shell commands in prose. For each one, suggest a code-block replacement that says the same thing in fewer tokens.

Collapsing repeated protocol patterns

Many workspace files define the same underlying pattern multiple times with slight variations: “before modifying X, do Y” appears in different sections for different values of X. Collapse these into a single protocol table rather than repeating the pattern.

Look for repeated protocol patterns in my workspace files: any instruction that follows the same structure applied to different scenarios. Suggest a table or consolidated protocol block that replaces the repetitions.

Splitting large files into referenced documents

If your AGENTS.md or SOUL.md has grown past 10,000 tokens, consider splitting it into a primary file and referenced supplementary files. The primary file contains always-needed context. Supplementary files are loaded on demand when specific tasks arise.

Analyze my workspace files. Which sections are used in every session versus only for specific task types? Suggest a split where always-needed content stays in the primary file and task-specific content moves to referenced supplementary files.

Measuring the results of your reduction

After making cuts, measure the impact in two ways: token count reduction and agent behavior.

After the system prompt reduction: measure the new token count of each workspace file, calculate the reduction versus the pre-audit baseline, and tell me the percentage saved. If we saved less than 20%, identify the next three highest-impact cuts.

For agent behavior, run a quick smoke test: ask the agent to perform a few representative tasks and verify it still has the context it needs. The risk of cutting too aggressively is that the agent loses information it relied on and starts asking clarifying questions or making wrong assumptions.

After reducing the workspace files, run a behavior check. Ask me three questions that would rely on context from the files we cut. If any answer shows missing context, identify which section we removed too aggressively.

How much can you realistically save?

Typical reductions from a thorough audit:

  • AGENTS.md: 20-40% reduction by removing historical incident logs, narrative explanations, and redundant protocols.
  • SOUL.md: 15-25% reduction by tightening persona text and consolidating repeated guidance.
  • Memory injection: 10-30% reduction by tuning autoRecall scope and category filters.

Combined, a well-maintained workspace typically runs 40-50k tokens at injection. An unaudited one that has grown organically over six months can easily reach 80-100k. The audit brings it back to the lean range without losing meaningful context.

Common questions

How do I know how many tokens my workspace files are actually using?

Ask your agent directly: “Count the tokens in each of my workspace files and show me a table sorted by size.” OpenClaw loads these files at session start, so the agent has them in context. The token count is the relevant metric, not character count, because the model processes tokens. A 10,000-character file with short words might be 2,500 tokens; a technical file with long identifiers might be 4,000 tokens for the same character count.

What is the maximum system prompt size I should target?

It depends on your context window and how much you want to preserve for working context. On a 200k token context window, a 20-30k token system prompt leaves plenty of room. On a 32k window, a 20k system prompt leaves very little for conversation. Target under 10% of your context window for workspace files, ideally under 5%.

I cut my AGENTS.md but the agent stopped following some of my protocols. What happened?

You cut something the agent was relying on. Review what you removed and identify the specific behavior that broke. The fix is to restore that specific section in a more compact form rather than restoring everything. The goal is to keep the information the agent needs while eliminating the prose that wraps it.

Memory injection adds a lot of tokens. Should I disable autoRecall?

Before disabling it, measure the actual injection size. Use memory_stats to see how many memories are being retrieved per session and estimate their average token count. If injection is adding 5,000+ tokens per session and the recalled memories are not consistently useful, tuning the recall threshold or reducing the number of results is more targeted than disabling autoRecall entirely.

Can I compress my workspace files with a summarizer to reduce size without losing content?

The risk is that an automated summarizer removes nuance that the agent needs. For factual sections (config values, file paths, contact info), summarization is safe. For behavioral guidance (protocols, how to handle edge cases), a human review of what stays and what goes produces better results than automated compression. If you use a summarizer, always review the output before replacing the original.

File size benchmarks for common workspace configuration files

These token ranges reflect well-maintained files at different stages of a mature project. Use them as reference points when auditing your own workspace. If your files are significantly above these ranges, the audit is likely to find meaningful savings.

  • SOUL.md: 1,500-3,000 tokens for a mature persona and behavioral guidance file. Anything over 4,000 tokens suggests narrative bloat or repeated instructions that should be consolidated.
  • AGENTS.md: 3,000-6,000 tokens for a full production setup with protocols, routing rules, and reliability architecture. Over 8,000 is worth auditing.
  • USER.md: 300-600 tokens. Should be concise factual notes, not biography.
  • MEMORY.md: 2,000-4,000 tokens. Grows over time as the project evolves and needs periodic pruning to remove superseded facts and stale decisions.
  • TOOLS.md: 300-800 tokens. Factual environment notes, no narrative needed.
  • CONSTRAINTS.md or equivalent: 500-1,000 tokens if you have a separate constraints file. Longer is usually a sign of edge-case accumulation that should be reviewed.

Compare my workspace file token counts against these benchmarks. Flag any files that exceed the upper range, tell me which sections are contributing to the excess, and suggest three specific cuts for each file that is over the target.

Keep Reading:

Common questions

How do I actually see how many tokens my system prompt is using?

Ask your agent directly: estimate the total token count of the current system prompt, broken down by source. Agents with tool access can read each file and give you a count. For a more precise number, you can use a tokenizer tool like OpenAI’s tiktoken on each file individually, but the agent estimate is usually accurate enough for planning purposes.

If I move a file to on-demand loading, will my agent know it exists?

Only if you tell it. The on-demand pattern requires a pointer: a brief note in AGENTS.md that says “this file exists and contains X — read it when working on tasks related to X.” Without that pointer, the agent has no reason to look for the file. The pointer itself is cheap (one or two lines) and ensures the content is available when relevant without loading it speculatively every session.

Does trimming the system prompt affect cron jobs?

Yes, and it is one of the most cost-effective reasons to do it. Every cron job start loads the full system prompt. A smaller prompt means every cron job is cheaper, every time it fires. If you have cron jobs running dozens of times per day, system prompt size directly multiplies into daily token cost. The savings compound over time.

Is there a minimum context size below which I should not go?

There is no universal floor. The minimum is whatever your agent needs to operate correctly: model routing rules, red lines, communication routing, active project state. For most setups this is 3,000 to 8,000 tokens of genuinely essential content. Everything above that is either useful-but-deferrable (a candidate for on-demand loading) or accumulated cruft (a candidate for deletion).

I cut my system prompt and now my agent is behaving differently. What happened?

Something you removed was load-bearing. The most common culprits: model routing rules (the agent defaults to config instead of context-aware routing), failure protocols (the agent no longer knows what to do when blocked), or active project state (the agent lost track of work in progress). Compare your old and new files, find what was removed, and decide whether to restore it or add a more concise version of the same instruction.

Get practical OpenClaw operator notes straight to your inbox.

From the same series

Why does OpenClaw keep compacting even on short sessions?

System prompt size and compaction threshold work together. If compaction is firing too early, start here.

Read →

Why is OpenClaw so slow? It is probably your context window.

Context window ceiling and system prompt size are the two biggest speed levers. This covers the window side.

Read →

Choosing a model based on your actual workload

After you have the prompt lean, the model choice determines what that leaner prompt costs per turn.

Read →