TL;DR Link to heading
My OpenClaw gateway went silent for three days after a usage spike, even though I could still chat to the same provider via its web interface normally. The API was serving requests, but OpenClaw had stored a “next reset in 6 days” message as a literal blockedUntil timestamp and refused to try the profile again. Without a fallback model configured, the probe-during-cooldown code never fires, so the profile stays blocked. Cleared the field in auth-state.json and restarted. I’ve added an hourly watchdog; not yet seen it fire.
Motivation Link to heading
Family chat went quiet and daily cron jobs stopped posting, though the gateway was up. The provider’s weekly cap in openclaw models status showed something like Week 44% left ⏱5d 17h, so there was credit available. I could open the provider’s web app in a browser and chat normally. But every cron run in the gateway log showed decision=skip_candidate ... Provider <id> is in cooldown (suspending lanes).
Last time this happened the heartbeat was firing every 30 minutes. The heartbeat was off this time (the trajectory log had zero events for three days), so this wasn’t a runaway client. The block was local state.
Tracing where the belief is stored Link to heading
A direct capability call worked:
openclaw infer model run --prompt "say hello in one word"
# Hello
So the API was fine. The belief had to live somewhere local. It was in auth-state.json:
{
"usageStats": {
"<provider>:<account>": {
"blockedUntil": 1780846982712,
"blockedReason": "subscription_limit",
"blockedSource": "wham",
"errorCount": 1,
"failureCounts": { "rate_limit": 1 },
"lastFailureAt": 1780401970719
}
}
}
That blockedUntil decodes to four days in the future. It was set when the upstream returned You've reached your subscription usage limit. Next reset in 6 days, Jun 7 at 3:43 PM UTC during the original spike. OpenClaw stored the timestamp verbatim. The provider’s weekly cap is a rolling window. The cap recovers as the oldest usage ages out of the window, not in one step at the “next reset” time. The API served small requests against the partially-recovered cap days before that date.
Why nothing tried to recover Link to heading
OpenClaw does have a probe-during-cooldown path. In model-fallback-DRgKirrj.js:
function shouldProbePrimaryDuringCooldown(params) {
if (!params.isPrimary || !params.hasFallbackCandidates) return false;
...
}
If a fallback model is configured, OpenClaw periodically retries the primary during cooldown, detects the recovery, and switches back. With fallbacks: [], the short-circuit on hasFallbackCandidates means no probe ever runs. The profile stays blocked until blockedUntil arrives, even if the API recovered the day after the failure.
There’s an open upstream issue (openclaw/openclaw#54278, filed in March) describing exactly this pattern. It proposes a separate quota_wait state with periodic probing regardless of fallback configuration. Not implemented yet.
The fix Link to heading
Two things, neither of which requires adding a fallback.
First, clear the stale block:
import json
p = "/path/to/.openclaw/agents/main/agent/auth-state.json"
state = json.load(open(p))
prof = state["usageStats"]["<provider>:<account>"]
for k in ["blockedUntil", "blockedReason", "blockedSource",
"errorCount", "failureCounts"]:
prof.pop(k, None)
json.dump(state, open(p, "w"), indent=2)
Restart the gateway and the profile is callable again.
Second, an hourly cron watchdog so I don’t have to do that manually next time:
#!/usr/bin/env python3
import json, os, shutil, time
AUTH_STATE = "/path/to/.openclaw/agents/main/agent/auth-state.json"
PROVIDER_PREFIX = "<provider>:"
FAR_FUTURE_MS = 12 * 3600 * 1000
FAILURE_GRACE_MS = 6 * 3600 * 1000
now_ms = int(time.time() * 1000)
state = json.load(open(AUTH_STATE))
cleared = []
for profile_id, stats in (state.get("usageStats") or {}).items():
if not profile_id.startswith(PROVIDER_PREFIX): continue
if stats.get("blockedReason") != "subscription_limit": continue
blocked_until = stats.get("blockedUntil")
if not blocked_until or blocked_until <= now_ms + FAR_FUTURE_MS: continue
if now_ms - stats.get("lastFailureAt", 0) < FAILURE_GRACE_MS: continue
for k in ["blockedUntil", "blockedReason", "blockedSource",
"errorCount", "failureCounts"]:
stats.pop(k, None)
cleared.append(profile_id)
if cleared:
shutil.copy(AUTH_STATE, f"{AUTH_STATE}.bak.{int(time.time())}")
tmp = AUTH_STATE + ".tmp"
json.dump(state, open(tmp, "w"), indent=2)
os.replace(tmp, AUTH_STATE)
print(f"cleared: {cleared}")
Cron:
17 * * * * /path/to/clear-stale-block.py >> /tmp/openclaw/clear-stale-block.log 2>&1
The script only clears blocks where:
- the reason is
subscription_limit, not auth or other failure types blockedUntilis set more than 12 hours in the future (so a normal short cap cooldown stays intact)- the last failure was more than 6 hours ago (giving short-window caps time to refresh on their own)
Why I missed it for three days Link to heading
The OpenClaw weekly stat in models status shows real numbers from the API, so it looks healthy. The web chat works because that’s a different rate-limit surface within the same subscription. The gateway is up, the heartbeat is off, the crons are scheduled. From the outside everything looked fine, and the only signals were:
- gateway logs show
skip_candidatedecisions on every cron tick models statusshows[cooldown Xd]against the profile idauth-state.jsonhas ablockedUntilset far in the future
If I see the gateway quiet for more than a day after disabling the heartbeat, this is the first thing I check now.
Further reading Link to heading
- OpenClaw repository
- openclaw/openclaw#54278, an open feature request describing the underlying
quota_waitvsreauth_requiredconfusion. - My post on reducing OpenClaw token usage is what got the heartbeat under control in the first place. Without that, I’d be clearing this block repeatedly.