ClawKit Logo
ClawKitReliability Toolkit

Use Case: DevOps & Monitoring Agent

health checks ยท log parsing ยท GitHub issues ยท service restarts

What you'll build

An always-on agent that checks your services on a schedule, parses logs for errors, opens a GitHub issue when something breaks, and can restart services via SSH โ€” with Telegram alerts so you know before users do.

Security note: The bash-runner skill can execute arbitrary shell commands. Always configure an allowlist to limit which commands the agent can run. Never run this agent with root privileges.

Skills you need

SkillWhat it unlocksRequired?
bash-runnerRun shell commands: curl health endpoints, check disk, parse logs with grep/awkCore
file-readerRead log files directly without shell โ€” useful for large log parsingCore
github-issuesOpen, label, and comment on GitHub issues automatically on detected failuresCore
http-requestHit HTTP health endpoints and check status codes and response bodiesRecommended
ssh-runnerRun commands on remote servers via SSH โ€” for restarts and remote log readsOptional
uptime-checkerMonitor URLs at intervals and alert on downtime or latency spikesOptional

SOUL.md template

SOUL.md โ€” DevOps agent
# DevOps Agent

You monitor services, detect failures, and escalate or remediate automatically.

## Identity
- Name: OpsBot
- Role: On-call ops agent

## Health check process
1. Hit each health endpoint listed below
2. If status != 200 or response time > 2s: alert immediately
3. Check error rate in logs: if >1% in last 5 min: alert
4. If same service has been down for 3+ consecutive checks: open a GitHub issue

## Escalation levels
- WARN: Telegram message only
- ERROR: Telegram + open GitHub issue (label: "incident", "automated")
- CRITICAL: Telegram + GitHub issue + attempt auto-restart (if in allowed list)

## Services to monitor
- API: https://api.myapp.com/health (expect: {"status":"ok"})
- Dashboard: https://app.myapp.com/ping (expect: 200)
- Worker: check process via: ps aux | grep worker.js

## Auto-restart allowed list
- pm2 restart api-server
- systemctl restart nginx

## Behaviour
- Do not restart the same service more than once per 15 minutes
- Always include the error message and timestamp in alerts
- GitHub issue title format: "[Incident] {service} down at {time}"

Config + command allowlist

Always restrict which shell commands the agent can run using the commandAllowlist config. This prevents accidental or malicious commands from executing.

openclaw.json โ€” DevOps agent with allowlist
{
  "skills": [
    "official-bash-runner",
    "official-file-reader",
    "official-github-issues",
    "official-http-request"
  ],
  "model": "claude-sonnet-4-5",
  "soulPath": "./SOUL.md",
  "commandAllowlist": [
    "curl *",
    "ps aux",
    "tail -n * /var/log/*",
    "grep * /var/log/*",
    "pm2 restart *",
    "systemctl restart nginx",
    "df -h",
    "free -m"
  ],
  "env": {
    "GITHUB_TOKEN": "ghp_your_token"
  },
  "channel": {
    "type": "telegram",
    "token": "YOUR_BOT_TOKEN",
    "chatId": "YOUR_CHAT_ID"
  }
}

Health check schedule

Add to openclaw.json โ€” health checks every 5 minutes
{
  "crons": [
    {
      "name": "Health check",
      "schedule": "*/5 * * * *",
      "task": "Run the health check process defined in SOUL.md. Check all services. Alert on failures."
    },
    {
      "name": "Daily log review",
      "schedule": "0 7 * * *",
      "task": "Read /var/log/app/error.log from the last 24 hours. Count errors by type, identify the top 3 most frequent errors, post a summary to Telegram."
    },
    {
      "name": "Disk space check",
      "schedule": "0 9 * * 1",
      "task": "Run df -h and free -m. If any disk partition is above 80% used, post a warning to Telegram with the partition path and usage percentage."
    }
  ]
}

SSH remote commands

To manage remote servers, add the ssh-runner skill and configure SSH key access. The agent will use your existing SSH config (~/.ssh/config).

Add SSH runner to config
{
  "skills": [
    "official-bash-runner",
    "official-ssh-runner",
    "official-github-issues",
    "official-http-request"
  ],
  "env": {
    "SSH_KEY_PATH": "~/.ssh/id_ed25519"
  }
}

Example prompt for remote operations:

"SSH into prod-server-1 (user: deploy) and check the last 50 lines of /var/log/app/error.log. If there are any ERROR lines from the last 10 minutes, restart the api-server process with pm2 restart api-server and alert me."

Common issues

โš  Agent runs a command not on the allowlist

OpenClaw will block the command and return an error. The agent will report it in the response. Add the specific command pattern to commandAllowlist if it's legitimate.

โš  GitHub issue created multiple times for the same incident

Add deduplication logic to your SOUL.md: "Before opening a GitHub issue, search open issues for the service name. If a recent issue already exists, comment on it instead of opening a new one."

โš  Health check cron runs but no alerts arrive

Test the Telegram channel manually first: ask the agent "Send a test message to Telegram". If that works, verify the health endpoint URLs in your SOUL.md are correct and reachable from the server.

โš  Auto-restart loops: service keeps going down

The SOUL.md rule "Do not restart the same service more than once per 15 minutes" prevents loops. If the service keeps failing after a restart, the agent should open a CRITICAL GitHub issue and stop trying.

Did this guide solve your problem?

Need Help?

Try our automated tools to solve common issues instantly.