How to stop OpenClaw cron jobs from piling up when tasks run long

A cron job that normally takes two minutes starts taking twelve. By the time you notice, three overlapping instances are running simultaneously, each one competing for the same model context and writing to the same state files. This is the openclaw cron job pileup problem, and it is more common than most operators realize until it happens to them. This article covers why OpenClaw cron jobs pile up when tasks run long, how to detect it while it is happening, and the specific settings that prevent the openclaw cron job overlap from recurring.

TL;DR

  • Root cause: OpenClaw does not enforce single-instance execution by default. If a job is still running when the next scheduled time fires, a new instance starts alongside it.
  • Detection: Check running sessions for duplicate job names, or look for stale lock files in the workspace.
  • Fix 1: Set a job timeout that is shorter than the schedule interval.
  • Fix 2: Use a lock file in the job prompt to prevent concurrent execution.
  • Fix 3: Switch slow openclaw cron jobs from time-based to completion-based scheduling.

Throughout this article you will see indented blocks like the ones below. Each one is a command you can paste directly into your OpenClaw chat. Your agent will run it and report back. You do not need to open a terminal or edit any files manually.

Why pileup happens

OpenClaw’s cron scheduler fires jobs on a fixed time interval. It does not check whether the previous run has finished before starting the next one. This is standard behavior for most cron systems , the scheduler’s job is to fire at the right time, not to track whether prior runs completed.

When a job runs long, the next scheduled instance starts while the first is still active. Both instances share the same model endpoint, the same workspace files, and the same Telegram delivery channel. If the job writes to a state file, both instances may try to write simultaneously, producing a corrupted file. If the job uses an API that has rate limits, concurrent instances double the request rate and trigger throttling that makes both runs slower.

The pileup compounds quickly: if two overlapping instances each run long, the third fires before either finishes and the problem escalates. Within a few cycles, the job queue is saturated and everything on the system slows down.

List all currently running cron sessions. Are any job names duplicated in the active session list? If so, show me the start times for each duplicate and how long each has been running.

Detecting a pileup in progress

When a pileup is already happening, the symptoms are: Telegram messages arriving in duplicates or triplets, model requests taking longer than usual, state files that are partially written or contain merged content from two runs, and increased API spend from redundant executions.

The fastest diagnostic is a session list check:

List all active sessions right now including cron sessions. Group by job name. Are there any job names that appear more than once? For each duplicate, show the session ID, start time, and current status.

If you see the same job name running twice with start times that differ by exactly your schedule interval (or a multiple of it), that is a pileup. Kill the older instances and address the root cause before the next scheduled fire.

Kill all sessions for cron job [job-name] except the most recently started one. Show me the session IDs that were killed.

Fix 1: set a job timeout shorter than the schedule interval

The simplest protection against pileup is a job timeout set to less than the schedule interval. If the job fires every 30 minutes and has a 25-minute timeout, the worst case is a job that runs for 25 minutes and then terminates before the next fire. There may be a gap in execution, but there will not be a pileup.

Show me the current timeout setting for cron job [job-name]. What is the schedule interval? Is the timeout shorter than the interval?

Update cron job [job-name]: set timeout to [interval minus 20%] seconds. For example, if the job runs every 15 minutes (900 seconds), set timeout to 720 seconds.

Timeout does not guarantee completion

A timeout terminates the session but does not roll back any work the job completed before the timeout. If the job was mid-way through writing a state file when it timed out, the state file may be partial. Add an explicit check to the job prompt: “If this run is approaching its time limit and has not finished, write whatever state you have so far and note that the run was truncated.”

The timeout approach works well for jobs where a truncated run is acceptable. For jobs where partial execution causes more problems than missing a run entirely, use the lock file approach instead.

Fix 2: use a lock file in the job prompt

A lock file is a file that the job creates at the start of a run and deletes at the end. If a new instance starts while the previous one is still running, it checks for the lock file and exits immediately instead of running in parallel.

The lock file pattern implemented in the job prompt:

Update the prompt for cron job [job-name] to add lock file handling. The first step of the prompt should be: “Check whether workspace/cron-locks/[job-name].lock exists. If it exists, read its contents to get the timestamp it was created. If the lock is less than [timeout] minutes old, exit immediately with message ‘Job already running, skipping this instance.’ If the lock is older than [timeout] minutes, the previous run likely crashed , delete the stale lock and continue. If the lock does not exist, create it with the current timestamp and continue with the job.” The last step should be: “Delete workspace/cron-locks/[job-name].lock.”

Create the directory workspace/cron-locks/ if it does not exist. This is where lock files for cron job single-instance enforcement will be stored.

The stale lock check is critical. If a job crashes without reaching the cleanup step, the lock file persists and all subsequent runs skip execution indefinitely. The stale lock threshold should be slightly longer than the maximum expected runtime of the job , if the job normally takes 5 minutes, set the stale threshold to 10 minutes.

Fix 3: switch to completion-based scheduling

For jobs where pileup risk is high and partial execution is unacceptable, the most reliable approach is to abandon time-based scheduling entirely and use completion-based scheduling instead. Each run schedules the next run only after it completes successfully.

The pattern: the job itself creates or updates a “next run” entry in a queue file at the end of each successful execution. A lightweight heartbeat checker (running every minute or every few minutes) reads the queue file and triggers the job when the scheduled time has passed.

Explain how I would implement completion-based scheduling for a cron job that normally runs every hour but sometimes takes up to 45 minutes. I want the next run to start only after the previous one finishes, not at the next clock interval. What files would I need and what would each job step look like?

Completion-based scheduling eliminates openclaw cron job pileup entirely at the cost of schedule predictability. The job will not always run at exactly the same clock time, especially after a long run. For most monitoring and maintenance tasks, the occasional schedule shift is acceptable. For jobs where exact timing matters (like a morning brief that needs to arrive before 8am), use the lock file approach instead and accept the occasional skipped run.

Why jobs start running slow

Pileup is usually a symptom, not the root cause. The root cause is a job that started running longer than expected. Understanding why the job slowed down prevents the pileup from recurring even after the concurrency controls are in place.

The most common reasons a cron job starts running longer than expected:

  • Model cold start: Local models (Ollama) need to load into memory before processing the first request. If OLLAMA_KEEP_ALIVE is not set to -1, the model unloads between requests and cold-starts on the next run. Each cold start adds 10 to 60 seconds depending on the model size.
  • Growing context: The job reads files that have grown since setup. A state file that started at 2 KB is now 80 KB because it has been accumulating entries. The model processes all of it on every run.
  • API rate limit backoff: The job hits a rate limit and waits for the retry window. If rate limits are consistent, every run takes longer than expected by the backoff duration.
  • Increased tool call depth: The job prompt was expanded over time with additional steps. Each new step adds processing time and the cumulative effect is a job that now takes three times as long as it did originally.

Show me the last 5 run times for cron job [job-name]. Is the duration trending up? Compare the first run time to the most recent run time. What changed in the job prompt or data sources that could explain the increase?

ollama-coldstart”>Fixing Ollama cold start delays

Cold start is the most common cause of a local-model cron job suddenly running long. A job that ran in 90 seconds when OLLAMA_KEEP_ALIVE was set permanently now takes 4 minutes because the model has to reload from disk on every scheduled fire.

Check the current OLLAMA_KEEP_ALIVE setting on this server. Run: systemctl show ollama –property=Environment and look for OLLAMA_KEEP_ALIVE. If it is not set to -1 or a long duration, show me how to fix it.

Setting OLLAMA_KEEP_ALIVE to -1 keeps the model in memory permanently after the first load. For openclaw cron concurrency control, this is one of the highest-value single config changes available because it removes a major source of runtime variance. The first run after a gateway restart still incurs the cold start cost, but all subsequent runs within the session benefit from the model already being in memory. For a model that adds 45 seconds of cold start to every run, this fix alone can cut total runtime by 30 to 50 percent.

Update the Ollama systemd service to set OLLAMA_KEEP_ALIVE=-1. Show me the command to add this environment variable to the service, reload systemd, and restart Ollama. Do not run the restart command yet , show me the full plan first.

Fixing context bloat in data reads

A job that reads a growing file on every run gets slower over time as the file grows. State files, log files, and queue files all grow if nothing is cleaning them up. The fix is to read only the relevant portion of the file rather than the whole thing.

For cron job [job-name]: what files does the prompt instruct the agent to read? For each file, show the current file size. Flag any file over 50 KB that the job reads in full on every run.

For a growing file, update the prompt to read only the relevant portion. For a state file, the job only needs the most recent entry. For a queue file, the job only needs rows with PENDING status. For a log file, the job only needs the last 100 lines.

Update the prompt for cron job [job-name]: instead of reading workspace/pipeline/ARTICLE-QUEUE.md in full, read only the lines that contain “PENDING” status. Show me the revised read instruction before applying it.

Monitoring job runtime trends

The best time to catch a pileup risk is before it becomes a pileup. Monitoring runtime trends on your longest jobs gives you early warning when a job is creeping toward its schedule interval.

Create a cron job that runs every Sunday at 11pm America/New_York. Prompt: “For each active cron job, look up the last 4 run durations. Calculate the average and the trend (is it going up, down, or stable?). Flag any job where the average runtime exceeds 70% of its schedule interval. Send a Telegram report listing each job, its average runtime, its schedule interval, and a risk level (safe / warning / critical).” Model: ollama/phi4:latest. Deliver to Telegram [your-chat-id].

A job whose average runtime is at 70% of its interval is at warning level. It has buffer, but that buffer is shrinking. A job at 90% utilization is one slow run away from a pileup. The weekly runtime report catches these before they become active openclaw cron pileup incidents.

Designing schedules that are pileup-resistant from the start

The most effective openclaw cron job pileup prevention is building the schedule with realistic buffer time rather than retrofitting concurrency controls later. Three guidelines for pileup-resistant schedule design:

  • Measure before scheduling. Before setting the schedule interval, run the job manually three times and measure the actual runtime. Use the slowest of the three as your baseline, not the average. Add 50% buffer. If the slowest manual run took 4 minutes, the minimum safe interval is 6 minutes.
  • Never schedule at 100% utilization. A job that takes 14 minutes on a 15-minute schedule has no buffer for a slow model, a rate limit backoff, or a slightly larger-than-usual data read. Target 60 to 70% utilization at most.
  • Separate fast jobs from slow jobs. Do not mix a 30-second heartbeat and a 10-minute data collection job on the same interval. Run the heartbeat on a short interval and the data collection on a longer one. If they share an interval, the heartbeat will sometimes wait behind the data collection job for model access.

Review all my active cron jobs. For each one, show the schedule interval and the average runtime from the last 4 runs. Calculate the utilization percentage (runtime / interval). Flag anything above 60%.

Understanding the cron session lifecycle

Each time a cron job fires, OpenClaw creates an isolated session. That session has a lifecycle: it starts, loads the model, processes the prompt, calls any tools, produces output, delivers it, and closes. The scheduler does not know or care what state any previous session is in. It fires according to the schedule, full stop.

This means the pileup problem is architectural, not a bug. The scheduler is doing exactly what it is designed to do. The concurrency controls are your responsibility to implement, not something the scheduler enforces for you.

Understanding the session lifecycle also tells you exactly where pileup controls can be applied most effectively:

  • At the session start: The prompt can check for a lock file and exit immediately if one exists. This is the most reliable control point because it runs before any model or tool calls consume resources.
  • At the config level: The job timeout setting terminates sessions that run beyond a maximum duration. This is a hard backstop, not a concurrency control, but it prevents any single slow run from blocking resources indefinitely.
  • At the schedule level: Setting a longer schedule interval with realistic buffer is the simplest pileup prevention of all. A job that cannot possibly overlap because the interval is twice the maximum runtime never needs a lock file.

For each of my active cron jobs, show: schedule interval, timeout setting, and the last 3 run durations. I want to see which jobs have meaningful buffer between runtime and interval and which ones are running close to the limit.

Rate limit backoff as a pileup trigger

A job that normally runs in 90 seconds can take 10 minutes if it hits a rate limit and has to back off and retry. Rate limit backoff is one of the most common reasons a previously stable cron job suddenly starts running long enough to cause pileup.

The sequence: Job A fires at 9:00am, hits a rate limit at 9:01am, waits 8 minutes for the retry window, finishes at 9:10am. Job B fires at 9:15am (15-minute interval), encounters the same rate limit condition from shared API quota, and also runs long. By 9:30, you have two or three instances of the same job either still running or having just finished, each having consumed far more API quota than expected.

Check whether any of my cron jobs are hitting rate limits. Look at the last run output for each job and flag any that contain rate limit errors, retry messages, or unusually long wait times between steps.

The fix for rate-limit-triggered openclaw cron job overlap is two-part: add a fallback model to the job config so rate limits on the primary model switch to a secondary rather than waiting, and increase the schedule interval to give the quota time to reset between runs.

Update cron job [job-name]: add a fallback model (ollama/phi4:latest) so that if the primary model hits a rate limit, the job switches to the local model instead of waiting for the retry window. Show me the updated config before applying it.

How prompt length and step count affect runtime

Every step added to a cron job prompt adds processing time. A prompt that started with three steps and now has eight steps takes roughly twice as long to execute, not because any individual step got slower but because the model processes more total content and makes more tool calls.

This is how many pileup problems develop gradually: the job was designed with a 5-minute runtime and a 15-minute interval. Over several weeks, three new steps were added to the prompt. The runtime is now 11 minutes. The next scheduled fire starts 4 minutes before the previous run finishes.

Show me the full prompt for cron job [job-name] and count the number of distinct steps. For each step, estimate whether it requires a tool call (file read, exec command, API call) or just text processing. Steps with tool calls are slower. Steps that are pure text processing are faster.

For prompts that have grown to the point where runtime exceeds the safe threshold, the options are:

  • Split the job into two jobs: one that collects data (cheap, fast) and one that synthesizes and delivers (slower, runs less frequently).
  • Remove steps that are no longer earning their place. Every prompt accumulates steps that made sense at the time but are not critical now.
  • Switch from a full read to a targeted read for any data source that has grown. Reading the last 10 lines of a file is faster than reading all 500 lines.

Audit the prompt for cron job [job-name]. Which steps could be removed without changing the core purpose of the job? Which data reads could be made more targeted to reduce the amount of content processed?

Emergency stop: killing all instances of a piling job

When a pileup is actively happening and you need to stop it immediately, the fastest path is to kill all active sessions for the job and then disable the job until you have fixed the root cause.

List all active sessions. Kill every session associated with cron job [job-name]. Then show me the cron job status to confirm it is not currently running any new instances.

Disable cron job [job-name] temporarily. I want to fix the runtime issue before the next scheduled fire. Confirm the job is paused and show me how to re-enable it when I am ready.

After disabling the job, check for any state file corruption from the concurrent runs before re-enabling. If two instances were writing to the same state file simultaneously, the file may contain merged or partial content from both runs. Reset the state file to a clean state before the next run reads it.

Read workspace/cron-state/[job-name].json. Does it look like valid JSON with a coherent structure? If it appears corrupted or merged from multiple writes, delete it so the next run starts fresh. Show me the current file contents before deleting anything.

Special handling for inherently long-running jobs

Some jobs are designed to take a long time: deep research tasks, batch content generation, multi-step analysis pipelines. These jobs cannot be sped up without changing their core function. For these, pileup prevention requires a different approach than tuning the schedule interval.

The right pattern for inherently long-running jobs is: run them on demand rather than on a schedule, or run them on a schedule that is deliberately set much longer than any expected runtime.

A content generation job that takes 30 to 90 minutes should not be on an hourly schedule. It should either be triggered manually when needed or set to run once daily or weekly, with a timeout of 120 minutes and a lock file. The schedule exists only to ensure it runs regularly, not to run it as frequently as possible.

I have a cron job that takes 30 to 90 minutes depending on how much content it generates. It is currently on a 2-hour schedule. Design a pileup-resistant configuration for it: schedule interval, timeout, lock file, and state handling. The job must complete cleanly even if it takes the full 90 minutes.

Preventing concurrent write corruption

When two instances of the same cron job run simultaneously, both may attempt to write to the same state file at the end of their respective runs. The resulting file depends on which write finishes last , and it may contain content merged from both runs, or it may contain only the output of the second write with the first overwritten entirely.

The corruption pattern: Instance A reads the state file at 9:00am and starts processing. Instance B starts at 9:15am (next schedule interval) and reads the same state file, which still has the pre-9:00am values because Instance A has not finished yet. Both instances compute their results using the same baseline. Both write back to the file at roughly the same time. One write succeeds; the other is lost.

For jobs that write state files, the lock file approach is the most complete protection because it prevents Instance B from running at all while Instance A is active. The timeout approach does not protect against concurrent writes because Instance A may time out mid-write, producing a partial file.

Check workspace/cron-state/ for any files that appear to contain merged content from multiple writes. Signs include: JSON files with duplicate keys, files with two different date fields, or files that are significantly larger than expected. List any files that look suspicious.

If you find a corrupted state file, delete it and let the next run start fresh rather than trying to repair it. A fresh start is always safer than attempting to parse and fix merged JSON. The state file for most openclaw cron jobs can be reconstructed entirely from scratch in a single clean run.

System resource impact of pileup

A pileup is not just a logical problem with overlapping job executions. It has real system resource consequences that can affect everything else running on the server, including the OpenClaw gateway itself.

Each cron session loads the model and holds context memory for the full duration of the run, regardless of whether useful work is being done. If phi4:latest uses 8 GB of memory for a single session, three concurrent instances of the same job are attempting to hold 24 GB simultaneously. On a VPS with 16 GB of RAM, the third instance will either fail to load the model or cause the other processes to swap to disk, slowing everything down.

Run: free -h to show current memory usage. Run: ps aux –sort=-%mem | head -10 to show the top memory-consuming processes. Is anything using unexpectedly high memory right now? Are there multiple Ollama or OpenClaw processes that should not be there?

After resolving a pileup and killing the duplicate sessions, check system resources before declaring the incident closed. Memory freed by killed sessions is released immediately, but disk I/O pressure from concurrent writes may take a few minutes to normalize. If the server was under severe memory pressure, a quick restart of the Ollama service clears any in-memory state that may have been left behind by killed sessions.

After killing the pileup sessions: check memory and CPU usage. Are both returning to normal baseline levels? If memory is still elevated after 2 minutes, show me the process list so I can identify what is still holding memory.

Running a full cron pileup risk audit

Before any pileup actually happens, running a proactive audit across all cron jobs surfaces which ones are at risk. This takes 5 minutes and can prevent hours of incident response.

Run a full openclaw cron job pileup risk audit. For every active cron job: (1) show the schedule interval in seconds, (2) show the average runtime from the last 3 runs in seconds, (3) calculate the utilization ratio, (4) check whether a timeout is set and whether it is shorter than the interval, (5) check whether the prompt contains lock file handling. Output a table with columns: Job Name, Interval, Avg Runtime, Utilization %, Has Timeout, Has Lock File, Risk Level (Safe/Warning/Critical).

Run this full openclaw cron pileup prevention audit once when first setting up cron jobs, and then repeat after adding any new steps to an existing job prompt or after any job has had a pileup incident. The audit takes longer to read than to run and it gives you a complete picture of your openclaw cron job pileup prevention posture across all jobs at once.

Combining controls for maximum reliability

For jobs that absolutely cannot pile up, use all three controls together: a schedule interval with at least 50% buffer, a timeout shorter than the interval, and a lock file in the prompt. Each control catches a different failure mode.

  • Buffer interval prevents pileup during normal operation when the job runs at its expected speed.
  • Timeout prevents any single run from blocking the schedule indefinitely, regardless of what caused it to run long.
  • Lock file prevents concurrent execution on the rare occasion when the scheduler fires while a legitimate long run is still in progress within the buffer window.

The three controls are complementary rather than redundant, and each one handles a different part of the failure space. A job with all three has defense in depth: it is hard to pile up, and when something unusual happens (a rate limit storm, a model cold start, an unusually large data read), at least one of the three controls will catch it before the pileup compounds.

For my most critical cron job [job-name]: (1) set the schedule interval to [runtime times 2], (2) set the timeout to [runtime times 1.5], (3) add lock file handling to the prompt. Show me the full updated config and prompt before applying anything.

For lower-stakes jobs where a missed run is acceptable, the buffer interval alone is usually sufficient. Add the lock file only when concurrent execution would produce incorrect results (state file writes, rate-limited APIs, delivery deduplication). Add the timeout only when unbounded runtime is a real risk based on actual observed run history.

Frequently asked questions

These questions cover the edge cases and operational questions that come up once pileup prevention is in place.

My job ran long once but normally it is fine. Do I still need pileup protection?

Yes. The pileup risk is proportional to the variance in runtime, not just the average. A job that normally takes 2 minutes but occasionally takes 20 (due to rate limits, model cold starts, or large data reads) needs pileup protection more than a job that consistently takes 8 minutes on a 15-minute interval. Use a lock file for jobs with high runtime variance.

Can I set concurrency limits at the OpenClaw config level instead of per-job?

OpenClaw does not have a native global concurrency limit for cron sessions. The controls available are per-job timeout (in the cron job config) and prompt-level lock files (implemented in the job prompt). For system-wide concurrency control, you would need to implement a shared lock file that all jobs check before starting, which is more complex but possible using the same lock file pattern described above.

The lock file approach requires the job to delete its own lock file. What if the job crashes before reaching that step?

That is the stale lock problem. Handle it with the stale lock threshold: check the lock file creation time on every run, and if the lock is older than the maximum expected runtime, delete it and proceed. Set the stale threshold to twice the normal runtime. If the job normally takes 3 minutes, set the stale threshold to 6 minutes. A lock older than 6 minutes indicates a crash, not an active run.

I have a job that must not run concurrently under any circumstances. Is the lock file approach reliable enough?

For most cases, yes. The lock file approach has one theoretical gap: if two instances check for the lock simultaneously before either has written it, both may proceed. In practice this does not happen with OpenClaw cron jobs because the scheduler fires one instance at a time and concurrent starts from two simultaneous scheduler firings are extremely rare. If you need strict guarantees, combine the lock file approach with a timeout that is shorter than the schedule interval. The combination eliminates both the race condition window and the stale lock risk.

How do I know when a job was killed by a timeout vs. when it completed normally?

Add a completion marker to the final step of the job prompt: “Write the text ‘COMPLETED’ to workspace/cron-state/[job-name]-last-result.txt along with the current timestamp.” If a run times out before reaching this step, the file will either be missing or will contain the timestamp from the previous successful run. Check this file in your weekly cron audit.

Can multiple jobs share a single model instance, or does each cron session get its own?

Each cron session gets its own connection to the model endpoint but they share the underlying model process. For Ollama, this means concurrent sessions are processed sequentially by the model (Ollama handles one request at a time by default). Concurrent cron sessions do not crash the model but they do queue behind each other, which is another reason pileup makes slow jobs slower: the second instance waits in the Ollama queue behind the first, then starts its own full processing run when it gets access.

Is there a way to get notified when a pileup is happening in real time?

Yes. Add a pileup detector to your daily health check: instruct the agent to count active cron sessions grouped by job name, and send an immediate Telegram alert if any job has more than one active session. This runs as part of the health check rather than as a separate monitor, so it requires no additional cron job overhead.


Go deeper

CronMy cron job works in testing but silently does nothing in productionFive root causes of silent cron failures and the fix for each.CronHow to schedule a daily task in OpenClaw without building a queue systemThe complete setup guide for cron jobs, schedule types, and delivery modes.CronHow to pass output from one OpenClaw cron run into the nextFile-based state handoff between isolated sessions for trend tracking.