agent-health-diagnostics
Diagnose and fix the 4 most common OpenClaw agent failures — heartbeat spam, API rate limit cascades, channel death loops, and memory/embedding errors. Battle-tested across a 6-agent multi-host deployment.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/agenthyjack/agent-health-diagnosticsAgent Health Diagnostics
Scripts available in the Collective Skills repo
Overview
When an OpenClaw agent misbehaves — spamming messages, going dark, burning API credits, or looping on dead channels — this skill provides the diagnostic playbook. Covers the 4 most common failure modes with exact commands to diagnose and fix each one.
Battle-tested across a 6-agent deployment spanning 3 hosts (Windows + Linux + Proxmox).
When to Use This Skill
Use when you observe any of these symptoms:
- Agent sending repeated heartbeat/status messages to Telegram/Discord/etc.
- Agent goes silent despite gateway showing "active"
- Logs show
429 Too many tokensorrate_limiterrors - Channel connection loops:
auto-restart attempt 1/10,2/10, etc. - Memory search errors:
input length exceeds context length - Gateway says "active" but agent doesn't respond to messages
The 4 Failure Modes
1. Heartbeat Spam
Symptom: Agent sends repeated messages every N minutes. Root cause: Heartbeat interval too low (10m = 144 messages/day) + verbose prompt that always generates output instead of HEARTBEAT_OK. Quick fix:
# Check interval
grep -A5 heartbeat ~/.openclaw/openclaw.json
# Fix: set to 30m minimum, simplify prompt to checklist + HEARTBEAT_OK default
# Then restart gateway
openclaw gateway restart
Prevention: Never set heartbeat below 20 minutes. Heartbeat prompts should CHECK things, not CREATE things.
2. API Rate Limit Cascade
Symptom: All models fail, agent goes dark. Root cause: Heartbeat + N crons = (N+1) API calls per interval. Exceeds provider TPM limit → all fallbacks exhausted simultaneously. Quick fix:
# Check for rate limits
journalctl -u <service> --since '1h ago' | grep '429\|rate_limit'
# Count your crons (each burns tokens)
openclaw cron list
# Fix: reduce heartbeat to 30-60m, disable non-essential crons, stagger schedules
Prevention: Calculate token budget before adding crons. Each run ≈ 2K-10K tokens. Route heartbeats to cheap/local models.
3. Channel Death Loop
Symptom: Logs show repeated auto-restart attempt N/10 for IRC/Discord/etc.
Root cause: Target server unreachable → health monitor restarts → fails again → loop. Each restart may trigger model calls, burning API tokens.
Quick fix:
# Check for loops
journalctl -u <service> --since '1h ago' | grep 'auto-restart\|timed out'
# Test connectivity
nc -zv <target-ip> <target-port> -w 5
# Fix: disable the broken channel in openclaw.json
# channels.<name>.enabled = false
openclaw gateway restart
Prevention: Test connectivity BEFORE enabling channels. Disable channels you can't reach.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-agenthyjack-agent-health-diagnostics": {
"enabled": true,
"auto_update": true
}
}
}Tags
Related Skills
calling-agent-squad
Activate a multi-agent team (the Squad) to manage complex projects, business tasks, or development workflows. The squad includes a Manager, Architect, Coder, Reviewer, and Observer. Use when the user wants to "call a squad", "start a project", or "deploy squad" with specialized roles and quality control loops.
incident-postmortem-assistant
将事故线索整理成复盘草案,区分根因、诱因、放大器、影响与修复动作。;use for incident, postmortem, sre workflows;do not use for 归责个人, 篡改时间线.
securityvitals
Security vitals checker for OpenClaw. Scans your installation, scores your setup, and shows you exactly what to fix. First scan in seconds.
cron-job-guardian
检查 cron 或 timer 配置中的频率、幂等、重试、日志与并发风险。;use for cron, timer, ops workflows;do not use for 直接启停生产任务, 替代真正监控.
harmonia
Check PyTorch, Transformers, and CUDA compatibility. Detect GPU, driver mismatches, and version conflicts in ML environments. Use when the user sets up ML/AI tools, installs torch or transformers, hits dependency errors, or asks about compatible versions.