My OpenClaw agent failed overnight and I didn’t find out until morning

You set your agent up to run tasks overnight. You wake up the next morning and nothing happened. Or something happened but you have no idea what, because there is no record. The agent ran into a problem, stopped, and waited. Nobody told you. This is not a fluke. It is the default behavior when you have not told your agent what to do when something goes wrong. This article explains how to add failure tracking, retry logic, and alerts so that a broken task notifies you immediately instead of disappearing silently.

TL;DR

Agents stop silently at the first error by default because stopping is the safe behavior when no failure instructions exist. The fix requires three additions: a status column in your task list that tracks PENDING, DONE, and FAILED states; explicit failure handling instructions in your agent prompt that define retry limits and escalation behavior; and a Telegram or Discord alert that fires when a task hits its retry limit. None of these are complex to add. None are in place by default.

Every indented block in this article is a command you can paste directly into your OpenClaw chat. Your agent will run it and report back. You do not need to open a terminal, edit any files manually, or navigate any filesystem.

Why agents fail silently

When your agent hits a problem during a task and has no instructions for what to do next, it stops. It does not alert you. It does not try again. It does not skip to the next task. It stops and waits.

This is the safe default behavior. An agent that keeps pushing through errors on its own could make things significantly worse: retrying an action that charges money, retrying a file write that corrupts data, or escalating an error into a cascade. Stopping is conservative. But safe is not the same as useful. The fix is giving your agent explicit instructions for what to do when something breaks, so “stopping” becomes “stopping and notifying you” instead of “stopping and going dark.”

Look at what happened in this session overnight. Were there any tasks attempted but not completed? If so: what was the task, what went wrong, and where did you stop? I want to understand the failure before I change anything.

The three additions that fix silent task failure

1. A status column in your task list

If your agent works through a task list, that list needs to record what happened to each task. Not just “done” or “not done” but at minimum three states: PENDING (waiting to run), DONE (completed successfully), and FAILED (hit retry limit or unrecoverable error). Without this, you have no way to know after the fact which tasks ran, which were skipped, and which need your attention.

Add two columns to your task list: a status column with values PENDING, DONE, FAILED, and RETRY; and a attempts column that counts how many times each task has been tried. These two columns give your agent enough information to implement retry logic and give you a readable failure record when you check in the morning.

Read my task list file. Does it have a status column that distinguishes between successful completion and failure? Does it track how many times each task has been attempted? If not, show me what the updated schema would look like with those additions. Do not change anything yet.

2. Explicit failure handling instructions in your agent prompt

Your agent follows instructions. If the instructions do not say what to do when a task fails, the agent has no basis for any action other than stopping. The failure handling instructions need to specify three things: when to retry, how many times, and what to do when retries are exhausted.

A minimal failure handling instruction looks like this:

Add failure handling instructions to my task-processing agent prompt. The rules should be: if a task fails, mark it RETRY and increment the attempts counter. If attempts is less than 3, try the task again at the next run. If attempts reaches 3, mark the task FAILED, stop retrying it, and send me a Telegram message with the task name and the error. Then move to the next PENDING task. Never stop processing the queue because one task failed.

That last line is critical. The most common consequence of silent failure is not just a failed task but a frozen queue. The agent fails task 3, stops, and tasks 4 through 10 never run. Explicit instructions to move on to the next task after a failure keep the queue moving even when individual tasks break.

3. A notification channel for failure alerts

The final piece is making sure you actually hear about failures when they happen. Marking a task FAILED in a file you might read tomorrow morning is not the same as a Telegram message that arrives immediately. For anything you are running overnight or over a weekend, you want the alert to reach you in real time.

Update my failure handling instructions to include a Telegram notification when any task hits its retry limit. The message should include: the task name, the number of attempts made, the error message from the last attempt, and the time of failure. Keep the message under 200 characters so it reads well as a notification.

Building retry logic that actually works in practice

Retry logic sounds simple but has several failure modes of its own. A naive “retry up to 3 times” rule can make things worse if you are not careful about what you are retrying and why.

Not all failures should be retried

Some failures are transient and worth retrying: a temporary API rate limit, a network timeout, a service that was momentarily unavailable. These are likely to succeed on the next attempt. Other failures are permanent and should not be retried at all: a missing file, an invalid API key, a malformed task description. Retrying these wastes attempts and delays the alert you need.

Update my failure handling instructions to distinguish between retryable and non-retryable failures. If the error indicates a rate limit or timeout, retry up to 3 times with a 60-second wait between attempts. If the error indicates a missing file, invalid credentials, or malformed input, mark the task FAILED immediately without retrying and send me the alert.

Adding a delay between retries

Retrying immediately after a failure is rarely the right approach. If a task failed because of a rate limit, retrying in the next second will hit the same rate limit. A minimum delay of 60 seconds between retry attempts handles most transient failures without wasting the retry budget on back-to-back attempts that will all fail for the same reason.

For tasks marked RETRY, add a “retry_after” timestamp field that is set to the current time plus 60 seconds when the retry is logged. The task processor should only pick up a RETRY task if the current time is past the retry_after timestamp. This prevents immediate back-to-back retries.

The retry counter is the guard, not the retry rule

A task that is marked RETRY with an attempts count of 2 should be picked up on the next queue run and tried again. The retry counter is how you enforce the limit. The task never needs to know its own retry history; the queue manager checks the counter and decides. This separation means your task definitions stay simple and the retry logic lives in one place: the queue processor instructions.

My task processor currently picks up any PENDING task. Update it to also pick up RETRY tasks where the attempts counter is below 3 and the retry_after timestamp has passed. RETRY tasks should be processed with the same priority as PENDING tasks of the same priority level.

Building an activity log

A status column tells you the current state of each task. An activity log tells you what happened over time. These serve different purposes. The status column answers “what is broken right now?” The activity log answers “what happened while I was asleep?” For serious overnight automation, you want both.

Create a task activity log at workspace/queue-activity.log. Every time a task is started, completed, or fails, append a line with: ISO timestamp, task ID, event type (started/completed/failed/retry), and a brief note. Use append mode so the log grows over time without overwriting previous entries.

An activity log is also your debugging tool when something goes wrong in a non-obvious way. If a task marked DONE produced incorrect output, the activity log shows exactly when it ran and how long it took. If a task is marked RETRY three times before finally succeeding, the log shows the error messages from each failed attempt alongside the final success.

Read workspace/queue-activity.log. Show me a summary of everything that happened in the last 24 hours: how many tasks ran, how many succeeded, how many failed, and which ones are still in RETRY state. Format it as a brief report I can read in 30 seconds.

The morning report pattern

Rather than checking the activity log manually each morning, set up a cron job that reads it and sends you a summary. This is the pattern that turns overnight automation from something you need to actively check into something that reports to you.

Create a cron job that runs every morning at 8am America/New_York. It should read workspace/queue-activity.log and send me a Telegram message with: total tasks run overnight, total successes, total failures, names of any tasks currently in FAILED state, and any tasks that retried more than once. Keep the message under 500 characters.

The morning report is worth setting up even if you have real-time failure alerts. Real-time alerts tell you when something breaks. The morning report gives you a complete picture of what ran successfully, which matters for confirming that the important tasks completed and not just for knowing that something failed.

Recovering from a failed overnight automation run

When you arrive in the morning and find failed tasks, the recovery sequence matters. Do not just reset the status and rerun. Check why the task failed first. A task that failed three times for the same reason will fail a fourth time if you reset and rerun without fixing the underlying problem.

Show me all tasks currently marked FAILED. For each one: what was the error on each attempt, is there a pattern across the failures, and what do I need to fix before retrying? Do not reset anything yet.

Once you understand the failure and have fixed the underlying issue, reset the task to PENDING and reset its attempts counter to zero. Do not leave the old attempt count in place; a task reset to PENDING with an attempts count of 3 will immediately be flagged as exhausted on the next run.

I have investigated the failure and fixed the underlying issue. Reset task [task ID] to status PENDING and set its attempts counter back to 0. Log this reset in the activity log with a note that it was manually reset after investigation.

Escalation: when retries alone are not enough

Some failures need more than a notification. A payment processing task that fails three times is not just a Telegram message situation. A data export that failed before a deadline needs immediate action. For high-stakes tasks, set up an escalation path that is more aggressive than a standard failure alert.

I have tasks that are more critical than others. For tasks marked with priority:critical, update the failure handling to: notify immediately after the first failure (not after 3 attempts), send to both Telegram and Discord, and include the full error message rather than a short summary. For standard-priority tasks, keep the existing 3-attempt rule with a short notification.

Common error types and how to handle each

Different error types call for different responses. Understanding the most common failure patterns in OpenClaw automation saves time when diagnosing a failed overnight run.

API rate limits

The most common transient failure. The agent hit an API rate limit, the call returned a 429 error, and the task stopped. Retry with a delay. Most rate limits reset within 60 seconds. If a task hits rate limits consistently across multiple runs, the root cause is a task that calls a rate-limited API too frequently. Increase the interval between runs or switch to a model or service with a higher rate limit.

Are there any tasks in my queue that have failed repeatedly due to rate limits? If so, what API are they hitting and how frequently are they running? Is there a way to reduce the call frequency or switch to an alternative that has a higher limit?

Missing or changed files

A task that reads a workspace file that has been deleted, moved, or renamed will fail immediately and will keep failing on retry. This is a non-retryable failure. Fix the file path reference in the task description and reset the task before rerunning.

Expired credentials

An API key or token that expired between when you set up the task and when it ran. This is also non-retryable. The credential needs to be refreshed in your config before the task can succeed. Real-time failure alerts are particularly valuable here: an expired credential caught on the first failure is easy to fix. One discovered three days later after 50 failed runs is more disruptive.

Model unavailability

The model specified in a task was unavailable at run time: at capacity, temporarily down, or removed from the provider. OpenClaw’s fallback chain handles some of these cases, but a task that specifies a model explicitly with no fallback will fail if that model is down. Review tasks with explicit model specifications and ensure they either have a fallback or use a model alias that routes to a fallback automatically.

Check my cron job payloads. Do any of them specify an explicit model with no fallback? If a task is hardcoded to a specific model and that model goes down, will the task fail completely? Show me which tasks are at risk and what fallback options exist.

Testing your failure handling before relying on it

The worst time to discover your failure handling is not working is after a critical overnight task fails silently. Test it before you need it. The simplest test is a deliberately broken task.

Add a test task to my queue with status PENDING. The task description is intentionally invalid: “Read the file at workspace/this-file-does-not-exist.txt and summarize it.” Run the queue processor and verify that: the failure is logged correctly, the attempts counter increments, the retry logic triggers, and I receive a Telegram notification when the retry limit is hit. Then delete the test task.

Running a deliberate failure test confirms every component is working: the status tracking, the activity log, the retry logic, and the notification channel. If any component is misconfigured, the test failure exposes it safely rather than a real task failure exposing it at the worst possible time.

Monitoring queue health over time

Individual task failures are the symptom. Queue health is the underlying metric. A queue that has been running for a week without review accumulates failed tasks, stale retries, and tasks that were completed but whose outputs were never used. Reviewing queue health periodically keeps the automation working correctly rather than degrading gradually into a state where more tasks are failing than succeeding.

Give me a full health report on my task queue. Include: total tasks by status (PENDING, DONE, FAILED, RETRY), any tasks that have been in RETRY state for more than 24 hours, any tasks that completed successfully but whose output files are missing or empty, and any tasks that have been in PENDING state for more than 48 hours without being picked up. Flag anything that looks like it needs attention.

A healthy queue has a predictable ratio of DONE to FAILED tasks. If you see FAILED tasks accumulating faster than DONE tasks over time, the failure rate is climbing and the root cause needs investigation. If PENDING tasks are not being picked up at the expected rate, the queue processor may have stopped running or its schedule is misconfigured.

Queue health metrics worth tracking

Three numbers tell you most of what you need to know about queue health:

  • Success rate (last 7 days): DONE / (DONE + FAILED). A healthy queue running routine tasks should be above 90%. Below 80% is a signal to investigate what is failing and why.
  • Average retry count for completed tasks: Tasks that complete on the first attempt are healthy. Tasks that consistently need 2-3 attempts before succeeding indicate a fragile process that is technically working but is closer to failure than it looks.
  • Queue age (oldest PENDING task): A PENDING task that has been waiting for 48 hours and never been picked up means the queue processor is not running. The oldest PENDING task should almost never be more than one full processor cycle old.

Read my activity log and calculate: success rate for the last 7 days, average retry count for tasks that eventually completed, and the age of the oldest PENDING task that has not been picked up. Flag any metric that is outside the healthy range.

Handling partial task completion

Some tasks are compound: they have multiple steps, and a failure partway through leaves the task partially completed. This is worse than a clean failure in some ways because the task looks like it ran but produced incomplete output. Partial completion failures are the hardest to catch with a simple status column, because the status might show FAILED even though half the work was done.

The solution is checkpoints within long tasks. For any task with more than three distinct steps, add a checkpoint write after each step that records what has been completed. If the task fails, the checkpoint tells you exactly where it stopped. If you reset and rerun the task, you can skip the already-completed steps.

For my task [task name], it has multiple steps and sometimes fails partway through. Add checkpoint writes after each step: write to workspace/task-checkpoints/[task-id]-checkpoint.md with the current step number and a brief note on what was completed. If the task fails, I can read the checkpoint to see where it stopped before deciding whether to reset and rerun from the beginning or investigate the specific failure point.

Idempotent tasks versus non-idempotent tasks

An idempotent task produces the same result whether it runs once or ten times. A file write that overwrites the same file is idempotent. An API call that sends a notification is not: running it ten times sends ten notifications. Retrying non-idempotent tasks requires care. If the task partially completed before failing, retrying it from the beginning may duplicate the already-completed portion.

Before setting a task to automatically retry, consider whether it is idempotent. If it is, automatic retries are safe. If it is not, you may want to require manual confirmation before retrying, or add logic that skips already-completed steps on a rerun.

Look at my FAILED and RETRY tasks. For each one, is the task idempotent (safe to run again from the beginning without side effects) or non-idempotent (could cause duplicates or unintended actions if rerun)? Flag any non-idempotent tasks that are set to auto-retry, since those need manual review before being reset.

Managing long-running tasks

A task that runs for 45 minutes without producing any output or a completion signal creates a different and harder-to-diagnose kind of problem than one that fails quickly with a clear error. OpenClaw has session time limits. A task that runs into those limits will terminate without a clean failure state, which means the status column will not reflect a failure and no alert will fire.

Do any of my scheduled tasks take a long time to complete? If a task runs for more than 30 minutes and the session times out before it finishes, how would that appear in my activity log? Would it show as a failure or would the log entry simply be missing? How can I detect this kind of silent mid-task termination?

The practical guard against long-task termination is a heartbeat write: every few minutes, the task writes a timestamp to a heartbeat file. A separate monitoring check compares the heartbeat timestamp against the current time. If the heartbeat is more than 10 minutes old, the task has stopped responding and needs investigation.

For my long-running task [task name], add a heartbeat write every 5 minutes to workspace/heartbeats/[task-id].txt with the current timestamp and the last completed step. Create a separate cron job that runs every 15 minutes and checks whether the heartbeat file is fresh. If the file is more than 10 minutes old while the task should still be running, send me a Telegram alert that the task may have stalled.

Queue cleanup and long-term archiving

A queue that grows indefinitely becomes harder to work with over time. Old DONE entries accumulate, the file gets large, and the queue processor takes longer to scan for the next PENDING task. Regular cleanup keeps the queue lean and responsive.

Create a weekly queue maintenance task that runs every Sunday at 11pm America/New_York. It should: move all DONE entries more than 7 days old to an archive file at workspace/queue-archive.md, move all FAILED entries more than 14 days old to the same archive, and write a brief summary of what was archived to the activity log. The active queue file should only contain recent completed tasks and all PENDING, RETRY, and recent FAILED entries.

The archive file preserves the full history for audit and compliance purposes without slowing down the active queue. If you ever need to investigate what happened three weeks ago, the archive has the record. For day-to-day operation, the active queue stays manageable.

Recognizing a queue that needs a reset

Sometimes the cleanest solution is to stop the queue processor, resolve all outstanding issues, and start fresh rather than triaging a large backlog of failed and stale tasks. Signs that a reset may be the right move:

  • More than 20% of tasks are in FAILED state
  • The oldest PENDING task has been waiting for more than a week
  • RETRY tasks are cycling without ever succeeding and no one has investigated the root cause
  • The activity log shows the queue processor ran zero tasks in the last 24 hours despite having PENDING tasks

A reset does not mean starting from scratch on your tasks. It means pausing the queue processor, reviewing the current state methodically, fixing or removing each failed task, confirming the processor schedule is working correctly, and then resuming with a clean slate. Twenty minutes of deliberate cleanup beats three more days of an accumulating failure backlog that grows faster than you can address it.

My queue has accumulated failures and I want to do a proper reset. Walk me through it: first show me the full current state, then help me triage each FAILED and RETRY task one by one (fix it, skip it, or reset it), then verify the queue processor is scheduled correctly, then confirm the first PENDING task runs successfully before I let the full queue run unattended again.

Common questions

My agent sent me a failure notification but the message was empty. Why?

The failure notification instruction was probably written to send “the error message” without specifying how to capture it. When a task fails, the error is in the tool response or exception that the agent received. The notification instruction needs to explicitly say “include the exact error text from the failed tool call” rather than a generic reference to “the error.” Ask your agent to update the failure notification template to capture and include the specific error text from whatever tool or API call failed.

My queue has 50 tasks. One failed and now the whole queue is stuck. How do I unblock it?

This is the missing “continue to next task on failure” instruction. Until you add that instruction, you need to manually reset the failed task to either fix it and set it back to PENDING, or mark it SKIPPED to let the queue move past it. For immediate unblocking: mark the failed task SKIPPED, let the queue run through the remaining tasks, then investigate the failure separately afterward. Then add the “never stop the queue for a single failure” instruction to prevent the same situation on the next run.

How do I know how long a failed task was running before it failed?

Add start and end timestamps to your activity log entries. Each task log line should include a “started” entry when processing begins and a “completed” or “failed” entry when it finishes. The difference between the two timestamps is the runtime. For any task that is consistently failing after a long runtime, timeout is the likely cause. A task running for 8 minutes before hitting a timeout is a different diagnosis from one that fails in 3 seconds.

Can I set different retry limits for different tasks?

Yes. Add a max_attempts column to your task list schema. Tasks without a value in that column use the default retry limit from your queue processor instructions. Tasks with a specific value use that limit instead. A one-time critical task might have max_attempts set to 5. A routine cleanup task might have max_attempts set to 1 because a single failure is sufficient reason to flag it for manual review rather than burning retry attempts automatically.

What is the difference between a task failing and a task timing out?

A failure is an explicit error: the tool call returned an error, the API returned a non-success status code, or the agent determined it could not complete the task. A timeout is different: the task was still running when the session’s time limit was reached, or the agent did not respond within the expected window. Timeouts should be handled by the failure logic the same way as errors, but they need explicit recognition in the failure instructions. Without it, a timed-out task may not get logged correctly because no explicit error was thrown.

My failure alert fires even for tasks that eventually succeed after one retry. How do I suppress that?

The notification should fire when the retry limit is exhausted, not on the first failure. If it is firing on the first attempt, the failure handling instruction probably says “notify on failure” rather than “notify when attempts reaches the retry limit.” Update the instruction to distinguish between a first failure (log and retry) and a retry-limit failure (log, mark FAILED, and notify). A task that fails once and succeeds on retry is not something you need to be woken up for.

Can I see a summary of all failures across the last week, not just overnight?

Yes, if your activity log has been running for a week. Ask your agent to read the full log and filter for failed entries with timestamps in the last seven days. Group them by task name so you can see whether any task is a recurring failure versus a one-off. Recurring failures on the same task are the ones worth investigating; they usually indicate a configuration problem that has not been fixed rather than a transient error.

What happens if the notification itself fails?

If the Telegram or Discord send fails when the agent tries to deliver a failure alert, you get a meta-failure: the failure was not communicated. The safest guard against this is redundancy: configure both Telegram and Discord as notification channels and send to both for critical failures. If one channel is down, the other delivers the alert. For non-critical tasks, a single channel is usually sufficient, but for anything running overnight that you absolutely need to know about, both channels together is cheap insurance.

How do I prevent a single task failure from blocking the whole queue?

Add one line to your queue processor instructions: “If a task fails or hits its retry limit, log it and move immediately to the next PENDING task. Never pause the queue because one task failed.” Without this explicit instruction, the agent has no basis for continuing after a failure and stops. With it, failures are handled as events to log and skip rather than events that halt everything. This single addition is the most important change you can make to an overnight queue. Everything else (retry logic, alerts, activity logs) is layered on top of this foundation.

My task failed but the status column still shows PENDING. Why did the status not update?

The status update is part of the task processing instructions, and it only runs if the agent got far enough to write the update. If a failure happened before the status write, the task stays PENDING. This is a common gap in hand-written queue processors. The fix is to write the status to RUNNING as the very first step when picking up a task, before any actual work begins. Then if a failure happens mid-task, the status shows RUNNING (not PENDING) and you know the task was picked up but did not complete. On the next queue run, RUNNING tasks more than N minutes old can be treated as stale and reset.

I have a task that always fails on the first attempt but succeeds on the second. Should I investigate or just increase the retry limit?

Investigate. A task that reliably fails once before succeeding is not working correctly . It is getting lucky on the second attempt. Common causes: a service that needs a warm-up call before it responds correctly, a timing dependency where the task is starting before a prerequisite is ready, or a rate limit that clears in the time between the first and second attempt. Finding the root cause and fixing it is far better than papering over it with automatic retries. A fragile task that “works” via retry is one outage away from failing all three attempts simultaneously.

Can I get a weekly summary of all failures instead of individual alerts for each one?

Yes, and for lower-priority tasks this is often the right approach to avoid notification fatigue. Create two notification tiers in your failure handling instructions: critical tasks send an immediate individual alert, standard tasks write to a weekly summary file. A cron job on Sunday evening reads the weekly summary file and sends it as a single Telegram message. The weekly summary covers the full batch of standard task failures in one message rather than sending individual alerts throughout the week.

What should I do if I wake up to 40 failed tasks after an API outage?

Bulk reset rather than individual triage. If the failure reason is clearly the same across all tasks (an API that was down, a credential that expired and has since been renewed), reset all FAILED tasks to PENDING in one operation after fixing the underlying issue. Then run the queue manually for a few tasks before letting it run automatically to confirm the fix worked. The key question before bulk-resetting is important: are all these failures genuinely from the same root cause? If yes, bulk reset and move on. If there are tasks that failed for a different reason mixed in, triage those separately before resetting the rest of the batch.


Queue Commander

Full queue system with failure handling, retry logic, and morning reports

The complete queue schema, processor prompt with built-in failure handling, activity log setup, morning report cron job, and the retry-with-delay pattern. Drop it in and your overnight automation reports to you instead of going dark.

Get Queue Commander for $67 →

Keep Reading:

Cron CommanderMy OpenClaw cron job ran twice or never ran at allThe three schedule types, their failure modes, and how to verify a job is scheduled correctly before relying on it.Cron CommanderHow to know what your agent actually did while you were awayBuilding an activity log that shows exactly what ran, when, and what it produced.Queue CommanderTask B ran before Task A finished and everything brokeHow to add dependency tracking so tasks run in the right order every time.