ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

agent-health-diagnostics

Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/agenthyjack/agent-health-diagnostics
Or

Agent Health Diagnostics

Scripts available in the Collective Skills repo

Overview

When an OpenClaw agent misbehaves — spamming messages, going dark, burning API credits, or looping on dead channels — this skill provides the diagnostic playbook. Covers the 4 most common failure modes with exact commands to diagnose and fix each one.

Battle-tested across a 6-agent deployment spanning 3 hosts (Windows + Linux + Proxmox).

When to Use This Skill

Use when you observe any of these symptoms:

  • Agent sending repeated heartbeat/status messages to Telegram/Discord/etc.
  • Agent goes silent despite gateway showing "active"
  • Logs show 429 Too many tokens or rate_limit errors
  • Channel connection loops: auto-restart attempt 1/10, 2/10, etc.
  • Memory search errors: input length exceeds context length
  • Gateway says "active" but agent doesn't respond to messages

The 4 Failure Modes

1. Heartbeat Spam

Symptom: Agent sends repeated messages every N minutes. Root cause: Heartbeat interval too low (10m = 144 messages/day) + verbose prompt that always generates output instead of HEARTBEAT_OK. Quick fix:

# Check interval
grep -A5 heartbeat ~/.openclaw/openclaw.json

# Fix: set to 30m minimum, simplify prompt to checklist + HEARTBEAT_OK default
# Then restart gateway
openclaw gateway restart

Prevention: Never set heartbeat below 20 minutes. Heartbeat prompts should CHECK things, not CREATE things.

2. API Rate Limit Cascade

Symptom: All models fail, agent goes dark. Root cause: Heartbeat + N crons = (N+1) API calls per interval. Exceeds provider TPM limit → all fallbacks exhausted simultaneously. Quick fix:

# Check for rate limits
journalctl -u <service> --since '1h ago' | grep '429\|rate_limit'

# Count your crons (each burns tokens)
openclaw cron list

# Fix: reduce heartbeat to 30-60m, disable non-essential crons, stagger schedules

Prevention: Calculate token budget before adding crons. Each run ≈ 2K-10K tokens. Route heartbeats to cheap/local models.

3. Channel Death Loop

Symptom: Logs show repeated auto-restart attempt N/10 for IRC/Discord/etc. Root cause: Target server unreachable → health monitor restarts → fails again → loop. Each restart may trigger model calls, burning API tokens. Quick fix:

# Check for loops
journalctl -u <service> --since '1h ago' | grep 'auto-restart\|timed out'

# Test connectivity
nc -zv <target-ip> <target-port> -w 5

# Fix: disable the broken channel in openclaw.json
# channels.<name>.enabled = false
openclaw gateway restart

Prevention: Test connectivity BEFORE enabling channels. Disable channels you can't reach.

Metadata

Stars3917
Views0
Updated2026-04-08
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-agenthyjack-agent-health-diagnostics": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags

#diagnostics#monitoring#health#heartbeat#troubleshooting#multi-agent#ops
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.