ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

itil-ops

ITIL-aligned incident, problem, and change management for AI agents. Use when: detecting service crashes, analyzing recurring failures, tracking incidents to resolution, performing root cause analysis, managing change requests, running health audits, or building operational review pipelines. Implements ITIL 4 practices adapted for autonomous agent operations: Incident Management, Problem Management, Change Management, Event Management, and Continual Improvement. Works with systemd, cron, journalctl, and coordination task boards.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/chefboyrdave21/itil-ops
Or

ITIL Ops — IT Service Management for AI Agents

Structured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.

Core Concepts

Severity Levels

LevelMeaningResponseExample
P1Critical — service down, data at riskImmediate alert + auto-remediateCrash loop, disk full, OOM
P2High — degraded serviceAlert within 1hService restarts, auth failures
P3Medium — non-critical issueNext review cycleCron timeouts, broken files
P4Low — cosmetic/minorTrack, fix when convenientLog warnings, config drift

Incident vs Problem vs Change

  • Incident: Something broke. Restore service ASAP. (reactive)
  • Problem: Pattern of incidents. Find and fix root cause. (proactive)
  • Change: Planned modification. Assess risk before executing. (controlled)

Incident Management

Detection Sources

Scan these in order of criticality:

  1. Service crashesjournalctl --user -u SERVICE --since "12 hours ago" for watchdog timeouts, SIGABRT, SIGSEGV, core dumps
  2. Cron failures — consecutive error count > 2 in job state files
  3. Health endpoints — HTTP health checks returning non-200
  4. Resource pressure — disk > 80%, RAM > 80%, swap active
  5. Data integrity — schema validation failures, broken files, load errors

Detection Script

Run scripts/itil-review.sh to scan all sources. It outputs:

  • ITIL_CLEAR if nothing found (reply HEARTBEAT_OK)
  • Formatted report with incidents and problems if issues detected

Incident Lifecycle

DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
                                      ↓
                              (3+ occurrences)
                                      ↓
                              ESCALATE TO PROBLEM

Auto-Classification Rules

# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected

# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures

# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%

# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise

Creating Incident Tickets

When incidents are found, create coordination tasks:

Title: [ITIL-INC] <brief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: <timestamp>
- Detail: <what happened>
- Impact: <what's affected>
- Action: <what to do>

Problem Management

Pattern Detection

An incident becomes a problem when:

  • Same error occurs 3+ times in 24h
  • Same incident type recurs across 2+ review cycles
  • Multiple related incidents share a common root cause

Metadata

Stars3875
Views1
Updated2026-04-07
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-chefboyrdave21-itil-ops": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.