Use Case: DevOps & Monitoring Agent
health checks ยท log parsing ยท GitHub issues ยท service restarts
What you'll build
An always-on agent that checks your services on a schedule, parses logs for errors, opens a GitHub issue when something breaks, and can restart services via SSH โ with Telegram alerts so you know before users do.
Security note: The bash-runner skill can execute arbitrary shell commands. Always configure an allowlist to limit which commands the agent can run. Never run this agent with root privileges.
Jump to section
Skills you need
| Skill | What it unlocks | Required? |
|---|---|---|
| bash-runner | Run shell commands: curl health endpoints, check disk, parse logs with grep/awk | Core |
| file-reader | Read log files directly without shell โ useful for large log parsing | Core |
| github-issues | Open, label, and comment on GitHub issues automatically on detected failures | Core |
| http-request | Hit HTTP health endpoints and check status codes and response bodies | Recommended |
| ssh-runner | Run commands on remote servers via SSH โ for restarts and remote log reads | Optional |
| uptime-checker | Monitor URLs at intervals and alert on downtime or latency spikes | Optional |
SOUL.md template
# DevOps Agent
You monitor services, detect failures, and escalate or remediate automatically.
## Identity
- Name: OpsBot
- Role: On-call ops agent
## Health check process
1. Hit each health endpoint listed below
2. If status != 200 or response time > 2s: alert immediately
3. Check error rate in logs: if >1% in last 5 min: alert
4. If same service has been down for 3+ consecutive checks: open a GitHub issue
## Escalation levels
- WARN: Telegram message only
- ERROR: Telegram + open GitHub issue (label: "incident", "automated")
- CRITICAL: Telegram + GitHub issue + attempt auto-restart (if in allowed list)
## Services to monitor
- API: https://api.myapp.com/health (expect: {"status":"ok"})
- Dashboard: https://app.myapp.com/ping (expect: 200)
- Worker: check process via: ps aux | grep worker.js
## Auto-restart allowed list
- pm2 restart api-server
- systemctl restart nginx
## Behaviour
- Do not restart the same service more than once per 15 minutes
- Always include the error message and timestamp in alerts
- GitHub issue title format: "[Incident] {service} down at {time}"Config + command allowlist
Always restrict which shell commands the agent can run using the commandAllowlist config. This prevents accidental or malicious commands from executing.
{
"skills": [
"official-bash-runner",
"official-file-reader",
"official-github-issues",
"official-http-request"
],
"model": "claude-sonnet-4-5",
"soulPath": "./SOUL.md",
"commandAllowlist": [
"curl *",
"ps aux",
"tail -n * /var/log/*",
"grep * /var/log/*",
"pm2 restart *",
"systemctl restart nginx",
"df -h",
"free -m"
],
"env": {
"GITHUB_TOKEN": "ghp_your_token"
},
"channel": {
"type": "telegram",
"token": "YOUR_BOT_TOKEN",
"chatId": "YOUR_CHAT_ID"
}
}Health check schedule
{
"crons": [
{
"name": "Health check",
"schedule": "*/5 * * * *",
"task": "Run the health check process defined in SOUL.md. Check all services. Alert on failures."
},
{
"name": "Daily log review",
"schedule": "0 7 * * *",
"task": "Read /var/log/app/error.log from the last 24 hours. Count errors by type, identify the top 3 most frequent errors, post a summary to Telegram."
},
{
"name": "Disk space check",
"schedule": "0 9 * * 1",
"task": "Run df -h and free -m. If any disk partition is above 80% used, post a warning to Telegram with the partition path and usage percentage."
}
]
}SSH remote commands
To manage remote servers, add the ssh-runner skill and configure SSH key access. The agent will use your existing SSH config (~/.ssh/config).
{
"skills": [
"official-bash-runner",
"official-ssh-runner",
"official-github-issues",
"official-http-request"
],
"env": {
"SSH_KEY_PATH": "~/.ssh/id_ed25519"
}
}Example prompt for remote operations:
"SSH into prod-server-1 (user: deploy) and check the last 50 lines of /var/log/app/error.log. If there are any ERROR lines from the last 10 minutes, restart the api-server process with pm2 restart api-server and alert me."
Common issues
โ Agent runs a command not on the allowlist
OpenClaw will block the command and return an error. The agent will report it in the response. Add the specific command pattern to commandAllowlist if it's legitimate.
โ GitHub issue created multiple times for the same incident
Add deduplication logic to your SOUL.md: "Before opening a GitHub issue, search open issues for the service name. If a recent issue already exists, comment on it instead of opening a new one."
โ Health check cron runs but no alerts arrive
Test the Telegram channel manually first: ask the agent "Send a test message to Telegram". If that works, verify the health endpoint URLs in your SOUL.md are correct and reachable from the server.
โ Auto-restart loops: service keeps going down
The SOUL.md rule "Do not restart the same service more than once per 15 minutes" prevents loops. If the service keeps failing after a restart, the agent should open a CRITICAL GitHub issue and stop trying.
Related guides
Did this guide solve your problem?