Notes on a stale provider cooldown in OpenClaw

TL;DR Link to heading

My OpenClaw gateway went silent for three days after a usage spike, even though I could still chat to the same provider via its web interface normally. The API was serving requests, but OpenClaw had stored a “next reset in 6 days” message as a literal blockedUntil timestamp and refused to try the profile again. Without a fallback model configured, the probe-during-cooldown code never fires, so the profile stays blocked. Cleared the field in auth-state.json and restarted. I’ve added an hourly watchdog; not yet seen it fire.

Motivation Link to heading

Family chat went quiet and daily cron jobs stopped posting, though the gateway was up. The provider’s weekly cap in openclaw models status showed something like Week 44% left ⏱5d 17h, so there was credit available. I could open the provider’s web app in a browser and chat normally. But every cron run in the gateway log showed decision=skip_candidate ... Provider <id> is in cooldown (suspending lanes).

Last time this happened the heartbeat was firing every 30 minutes. The heartbeat was off this time (the trajectory log had zero events for three days), so this wasn’t a runaway client. The block was local state.

Tracing where the belief is stored Link to heading

A direct capability call worked:

openclaw infer model run --prompt "say hello in one word"
# Hello

So the API was fine. The belief had to live somewhere local. It was in auth-state.json:

{
  "usageStats": {
    "<provider>:<account>": {
      "blockedUntil": 1780846982712,
      "blockedReason": "subscription_limit",
      "blockedSource": "wham",
      "errorCount": 1,
      "failureCounts": { "rate_limit": 1 },
      "lastFailureAt": 1780401970719
    }
  }
}

That blockedUntil decodes to four days in the future. It was set when the upstream returned You've reached your subscription usage limit. Next reset in 6 days, Jun 7 at 3:43 PM UTC during the original spike. OpenClaw stored the timestamp verbatim. The provider’s weekly cap is a rolling window. The cap recovers as the oldest usage ages out of the window, not in one step at the “next reset” time. The API served small requests against the partially-recovered cap days before that date.

Why nothing tried to recover Link to heading

OpenClaw does have a probe-during-cooldown path. In model-fallback-DRgKirrj.js:

function shouldProbePrimaryDuringCooldown(params) {
  if (!params.isPrimary || !params.hasFallbackCandidates) return false;
  ...
}

If a fallback model is configured, OpenClaw periodically retries the primary during cooldown, detects the recovery, and switches back. With fallbacks: [], the short-circuit on hasFallbackCandidates means no probe ever runs. The profile stays blocked until blockedUntil arrives, even if the API recovered the day after the failure.

There’s an open upstream issue (openclaw/openclaw#54278, filed in March) describing exactly this pattern. It proposes a separate quota_wait state with periodic probing regardless of fallback configuration. Not implemented yet.

The fix Link to heading

Two things, neither of which requires adding a fallback.

First, clear the stale block:

import json
p = "/path/to/.openclaw/agents/main/agent/auth-state.json"
state = json.load(open(p))
prof = state["usageStats"]["<provider>:<account>"]
for k in ["blockedUntil", "blockedReason", "blockedSource",
         "errorCount", "failureCounts"]:
    prof.pop(k, None)
json.dump(state, open(p, "w"), indent=2)

Restart the gateway and the profile is callable again.

Second, an hourly cron watchdog so I don’t have to do that manually next time:

#!/usr/bin/env python3
import json, os, shutil, time

AUTH_STATE = "/path/to/.openclaw/agents/main/agent/auth-state.json"
PROVIDER_PREFIX = "<provider>:"
FAR_FUTURE_MS = 12 * 3600 * 1000
FAILURE_GRACE_MS = 6 * 3600 * 1000

now_ms = int(time.time() * 1000)
state = json.load(open(AUTH_STATE))

cleared = []
for profile_id, stats in (state.get("usageStats") or {}).items():
    if not profile_id.startswith(PROVIDER_PREFIX): continue
    if stats.get("blockedReason") != "subscription_limit": continue
    blocked_until = stats.get("blockedUntil")
    if not blocked_until or blocked_until <= now_ms + FAR_FUTURE_MS: continue
    if now_ms - stats.get("lastFailureAt", 0) < FAILURE_GRACE_MS: continue
    for k in ["blockedUntil", "blockedReason", "blockedSource",
              "errorCount", "failureCounts"]:
        stats.pop(k, None)
    cleared.append(profile_id)

if cleared:
    shutil.copy(AUTH_STATE, f"{AUTH_STATE}.bak.{int(time.time())}")
    tmp = AUTH_STATE + ".tmp"
    json.dump(state, open(tmp, "w"), indent=2)
    os.replace(tmp, AUTH_STATE)
    print(f"cleared: {cleared}")

Cron:

17 * * * * /path/to/clear-stale-block.py >> /tmp/openclaw/clear-stale-block.log 2>&1

The script only clears blocks where:

the reason is subscription_limit, not auth or other failure types
blockedUntil is set more than 12 hours in the future (so a normal short cap cooldown stays intact)
the last failure was more than 6 hours ago (giving short-window caps time to refresh on their own)

Why I missed it for three days Link to heading

The OpenClaw weekly stat in models status shows real numbers from the API, so it looks healthy. The web chat works because that’s a different rate-limit surface within the same subscription. The gateway is up, the heartbeat is off, the crons are scheduled. From the outside everything looked fine, and the only signals were:

gateway logs show skip_candidate decisions on every cron tick
models status shows [cooldown Xd] against the profile id
auth-state.json has a blockedUntil set far in the future

If I see the gateway quiet for more than a day after disabling the heartbeat, this is the first thing I check now.