itil-ops
ITIL-aligned incident, problem, and change management for AI agents. Use when: detecting service crashes, analyzing recurring failures, tracking incidents to resolution, performing root cause analysis, managing change requests, running health audits, or building operational review pipelines. Implements ITIL 4 practices adapted for autonomous agent operations: Incident Management, Problem Management, Change Management, Event Management, and Continual Improvement. Works with systemd, cron, journalctl, and coordination task boards.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/chefboyrdave21/itil-opsITIL Ops — IT Service Management for AI Agents
Structured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.
Core Concepts
Severity Levels
| Level | Meaning | Response | Example |
|---|---|---|---|
| P1 | Critical — service down, data at risk | Immediate alert + auto-remediate | Crash loop, disk full, OOM |
| P2 | High — degraded service | Alert within 1h | Service restarts, auth failures |
| P3 | Medium — non-critical issue | Next review cycle | Cron timeouts, broken files |
| P4 | Low — cosmetic/minor | Track, fix when convenient | Log warnings, config drift |
Incident vs Problem vs Change
- Incident: Something broke. Restore service ASAP. (reactive)
- Problem: Pattern of incidents. Find and fix root cause. (proactive)
- Change: Planned modification. Assess risk before executing. (controlled)
Incident Management
Detection Sources
Scan these in order of criticality:
- Service crashes —
journalctl --user -u SERVICE --since "12 hours ago"for watchdog timeouts, SIGABRT, SIGSEGV, core dumps - Cron failures — consecutive error count > 2 in job state files
- Health endpoints — HTTP health checks returning non-200
- Resource pressure — disk > 80%, RAM > 80%, swap active
- Data integrity — schema validation failures, broken files, load errors
Detection Script
Run scripts/itil-review.sh to scan all sources. It outputs:
ITIL_CLEARif nothing found (reply HEARTBEAT_OK)- Formatted report with incidents and problems if issues detected
Incident Lifecycle
DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
↓
(3+ occurrences)
↓
ESCALATE TO PROBLEM
Auto-Classification Rules
# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected
# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures
# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%
# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise
Creating Incident Tickets
When incidents are found, create coordination tasks:
Title: [ITIL-INC] <brief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: <timestamp>
- Detail: <what happened>
- Impact: <what's affected>
- Action: <what to do>
Problem Management
Pattern Detection
An incident becomes a problem when:
- Same error occurs 3+ times in 24h
- Same incident type recurs across 2+ review cycles
- Multiple related incidents share a common root cause
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-chefboyrdave21-itil-ops": {
"enabled": true,
"auto_update": true
}
}
}