ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

afrexai-observability-engine

Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/1kalin/afrexai-observability-engine
Or

Observability & Reliability Engineering

Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.


Quick Health Check (/16)

Score your current observability posture:

SignalHealthy (2)Weak (1)Missing (0)
Structured loggingJSON logs with trace_id correlationLogs exist but unstructuredConsole.log / print statements
Metrics collectionRED/USE metrics with dashboardsSome metrics, no dashboardsNo metrics
Distributed tracingFull request path with samplingPartial traces, key services onlyNo tracing
AlertingSLO-based alerts with runbooksThreshold alerts, some runbooksNo alerts or all-noise
Incident responseDefined process with roles + post-mortemsAd-hoc response, some docs"Whoever notices fixes it"
SLOs definedSLOs with error budgets tracked weeklyInformal availability targetsNo reliability targets
On-call rotationStructured rotation with escalationInformal "call someone"No on-call
Cost managementObservability budget tracked monthlySome awareness of costsNo idea what you spend

12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.


Phase 1: Structured Logging

Log Architecture

Application → Structured JSON → Log Router → Storage → Query Engine
                                    ↓
                              Alert Pipeline

Required Fields (Every Log Line)

FieldTypePurposeExample
timestampISO-8601 UTCWhen2026-02-22T18:30:00.123Z
levelenumSeverityinfo, warn, error, fatal
servicestringWhich servicepayment-api
versionstringWhich deployv2.3.1
environmentstringWhich envproduction
messagestringWhat happenedPayment processed successfully
trace_idstringRequest correlationabc123def456
span_idstringOperation within tracespan_789
duration_msnumberHow long142

Contextual Fields (Add Per Domain)

# HTTP request context
http:
  method: POST
  path: /api/v1/orders
  status: 201
  client_ip: 203.0.113.42  # Anonymize in logs if needed
  user_agent: "Mozilla/5.0..."
  request_id: "req_abc123"

# Business context
business:
  user_id: "usr_456"
  tenant_id: "tenant_789"
  order_id: "ord_012"
  action: "checkout"
  amount_cents: 4999
  currency: "USD"

# Error context
error:
  type: "PaymentDeclinedError"
  message: "Card declined: insufficient funds"
  code: "CARD_DECLINED"
  stack: "..." # Only in non-production or DEBUG level
  retry_count: 2
  retryable: true

Log Level Decision Tree

Metadata

Author@1kalin
Stars2387
Views0
Updated2026-03-09
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-1kalin-afrexai-observability-engine": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags

#observability#monitoring#logging#tracing#alerting#sre#incident-response#slo#metrics#devops#reliability#on-call#post-mortem#dashboards
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.