Use Case: DevOps & Monitoring Agent

health checks · log parsing · GitHub issues · service restarts

What you'll build

An always-on agent that checks your services on a schedule, parses logs for errors, opens a GitHub issue when something breaks, and can restart services via SSH — with Telegram alerts so you know before users do.

Security note: The bash-runner skill can execute arbitrary shell commands. Always configure an allowlist to limit which commands the agent can run. Never run this agent with root privileges.

Jump to section

Skills you need SOUL.md template Config + allowlist Health check schedule SSH remote commands Common issues

Skills you need

Skill	What it unlocks	Required?
bash-runner	Run shell commands: curl health endpoints, check disk, parse logs with grep/awk	Core
file-reader	Read log files directly without shell — useful for large log parsing	Core
github-issues	Open, label, and comment on GitHub issues automatically on detected failures	Core
http-request	Hit HTTP health endpoints and check status codes and response bodies	Recommended
ssh-runner	Run commands on remote servers via SSH — for restarts and remote log reads	Optional
uptime-checker	Monitor URLs at intervals and alert on downtime or latency spikes	Optional

SOUL.md template

SOUL.md — DevOps agent

# DevOps Agent

You monitor services, detect failures, and escalate or remediate automatically.

## Identity
- Name: OpsBot
- Role: On-call ops agent

## Health check process
1. Hit each health endpoint listed below
2. If status != 200 or response time > 2s: alert immediately
3. Check error rate in logs: if >1% in last 5 min: alert
4. If same service has been down for 3+ consecutive checks: open a GitHub issue

## Escalation levels
- WARN: Telegram message only
- ERROR: Telegram + open GitHub issue (label: "incident", "automated")
- CRITICAL: Telegram + GitHub issue + attempt auto-restart (if in allowed list)

## Services to monitor
- API: https://api.myapp.com/health (expect: {"status":"ok"})
- Dashboard: https://app.myapp.com/ping (expect: 200)
- Worker: check process via: ps aux | grep worker.js

## Auto-restart allowed list
- pm2 restart api-server
- systemctl restart nginx

## Behaviour
- Do not restart the same service more than once per 15 minutes
- Always include the error message and timestamp in alerts
- GitHub issue title format: "[Incident] {service} down at {time}"

Config + command allowlist

Always restrict which shell commands the agent can run using the commandAllowlist config. This prevents accidental or malicious commands from executing.

openclaw.json — DevOps agent with allowlist

{
  "skills": [
    "official-bash-runner",
    "official-file-reader",
    "official-github-issues",
    "official-http-request"
  ],
  "model": "claude-sonnet-4-5",
  "soulPath": "./SOUL.md",
  "commandAllowlist": [
    "curl *",
    "ps aux",
    "tail -n * /var/log/*",
    "grep * /var/log/*",
    "pm2 restart *",
    "systemctl restart nginx",
    "df -h",
    "free -m"
  ],
  "env": {
    "GITHUB_TOKEN": "ghp_your_token"
  },
  "channel": {
    "type": "telegram",
    "token": "YOUR_BOT_TOKEN",
    "chatId": "YOUR_CHAT_ID"
  }
}

Health check schedule

Add to openclaw.json — health checks every 5 minutes

{
  "crons": [
    {
      "name": "Health check",
      "schedule": "*/5 * * * *",
      "task": "Run the health check process defined in SOUL.md. Check all services. Alert on failures."
    },
    {
      "name": "Daily log review",
      "schedule": "0 7 * * *",
      "task": "Read /var/log/app/error.log from the last 24 hours. Count errors by type, identify the top 3 most frequent errors, post a summary to Telegram."
    },
    {
      "name": "Disk space check",
      "schedule": "0 9 * * 1",
      "task": "Run df -h and free -m. If any disk partition is above 80% used, post a warning to Telegram with the partition path and usage percentage."
    }
  ]
}

SSH remote commands

To manage remote servers, add the ssh-runner skill and configure SSH key access. The agent will use your existing SSH config (~/.ssh/config).

Add SSH runner to config

{
  "skills": [
    "official-bash-runner",
    "official-ssh-runner",
    "official-github-issues",
    "official-http-request"
  ],
  "env": {
    "SSH_KEY_PATH": "~/.ssh/id_ed25519"
  }
}

Example prompt for remote operations:

"SSH into prod-server-1 (user: deploy) and check the last 50 lines of /var/log/app/error.log. If there are any ERROR lines from the last 10 minutes, restart the api-server process with pm2 restart api-server and alert me."

Common issues

⚠ Agent runs a command not on the allowlist

OpenClaw will block the command and return an error. The agent will report it in the response. Add the specific command pattern to commandAllowlist if it's legitimate.

⚠ GitHub issue created multiple times for the same incident

Add deduplication logic to your SOUL.md: "Before opening a GitHub issue, search open issues for the service name. If a recent issue already exists, comment on it instead of opening a new one."

⚠ Health check cron runs but no alerts arrive

Test the Telegram channel manually first: ask the agent "Send a test message to Telegram". If that works, verify the health endpoint URLs in your SOUL.md are correct and reachable from the server.

⚠ Auto-restart loops: service keeps going down

The SOUL.md rule "Do not restart the same service more than once per 15 minutes" prevents loops. If the service keeps failing after a restart, the agent should open a CRITICAL GitHub issue and stop trying.

Did this guide solve your problem?