ITIL Ops — IT Service Management for AI Agents

Structured incident, problem, and change management adapted from ITIL 4 for autonomous agent operations.

Core Concepts

Severity Levels

Level	Meaning	Response	Example
P1	Critical — service down, data at risk	Immediate alert + auto-remediate	Crash loop, disk full, OOM
P2	High — degraded service	Alert within 1h	Service restarts, auth failures
P3	Medium — non-critical issue	Next review cycle	Cron timeouts, broken files
P4	Low — cosmetic/minor	Track, fix when convenient	Log warnings, config drift

Incident vs Problem vs Change

Incident: Something broke. Restore service ASAP. (reactive)
Problem: Pattern of incidents. Find and fix root cause. (proactive)
Change: Planned modification. Assess risk before executing. (controlled)

Incident Management

Detection Sources

Scan these in order of criticality:

Service crashes — journalctl --user -u SERVICE --since "12 hours ago" for watchdog timeouts, SIGABRT, SIGSEGV, core dumps
Cron failures — consecutive error count > 2 in job state files
Health endpoints — HTTP health checks returning non-200
Resource pressure — disk > 80%, RAM > 80%, swap active
Data integrity — schema validation failures, broken files, load errors

Detection Script

Run scripts/itil-review.sh to scan all sources. It outputs:

ITIL_CLEAR if nothing found (reply HEARTBEAT_OK)
Formatted report with incidents and problems if issues detected

Incident Lifecycle

DETECTED → CLASSIFIED (P1-P4) → DIAGNOSED → RESOLVED → CLOSED
                                      ↓
                              (3+ occurrences)
                                      ↓
                              ESCALATE TO PROBLEM

Auto-Classification Rules

# P1 — Critical
- Service crash count >= 3 in 12h (crash loop)
- Disk usage >= 90%
- RAM usage >= 90%
- Data loss detected

# P2 — High
- Service crashed 1-2 times
- 3+ services down simultaneously
- Auth/token failures affecting operations
- Cron job with 5+ consecutive failures

# P3 — Medium
- Broken data files (schema violations)
- Memory load errors > 10 in 12h
- Cron job with 3-4 consecutive failures
- Disk usage 80-89%

# P4 — Low
- 1 service down (non-critical)
- Config warnings
- Log noise

Creating Incident Tickets

When incidents are found, create coordination tasks:

Title: [ITIL-INC] <brief description>
Body:
- Severity: P1/P2/P3/P4
- Category: service|cron|memory|disk|security
- Detected: <timestamp>
- Detail: <what happened>
- Impact: <what's affected>
- Action: <what to do>

Problem Management

Pattern Detection

An incident becomes a problem when:

Same error occurs 3+ times in 24h
Same incident type recurs across 2+ review cycles
Multiple related incidents share a common root cause

itil-ops

Install via CLI (Recommended)