afrexai-observability-engine
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/1kalin/afrexai-observability-engineObservability & Reliability Engineering
Complete system for building observable, reliable services — from structured logging to incident response to SLO-driven development.
Quick Health Check (/16)
Score your current observability posture:
| Signal | Healthy (2) | Weak (1) | Missing (0) |
|---|---|---|---|
| Structured logging | JSON logs with trace_id correlation | Logs exist but unstructured | Console.log / print statements |
| Metrics collection | RED/USE metrics with dashboards | Some metrics, no dashboards | No metrics |
| Distributed tracing | Full request path with sampling | Partial traces, key services only | No tracing |
| Alerting | SLO-based alerts with runbooks | Threshold alerts, some runbooks | No alerts or all-noise |
| Incident response | Defined process with roles + post-mortems | Ad-hoc response, some docs | "Whoever notices fixes it" |
| SLOs defined | SLOs with error budgets tracked weekly | Informal availability targets | No reliability targets |
| On-call rotation | Structured rotation with escalation | Informal "call someone" | No on-call |
| Cost management | Observability budget tracked monthly | Some awareness of costs | No idea what you spend |
12-16: Production-grade. Focus on optimization. 8-11: Foundation exists. Fill the gaps systematically. 4-7: Significant risk. Prioritize alerting + incident response. 0-3: Flying blind. Start with Phase 1 immediately.
Phase 1: Structured Logging
Log Architecture
Application → Structured JSON → Log Router → Storage → Query Engine
↓
Alert Pipeline
Required Fields (Every Log Line)
| Field | Type | Purpose | Example |
|---|---|---|---|
timestamp | ISO-8601 UTC | When | 2026-02-22T18:30:00.123Z |
level | enum | Severity | info, warn, error, fatal |
service | string | Which service | payment-api |
version | string | Which deploy | v2.3.1 |
environment | string | Which env | production |
message | string | What happened | Payment processed successfully |
trace_id | string | Request correlation | abc123def456 |
span_id | string | Operation within trace | span_789 |
duration_ms | number | How long | 142 |
Contextual Fields (Add Per Domain)
# HTTP request context
http:
method: POST
path: /api/v1/orders
status: 201
client_ip: 203.0.113.42 # Anonymize in logs if needed
user_agent: "Mozilla/5.0..."
request_id: "req_abc123"
# Business context
business:
user_id: "usr_456"
tenant_id: "tenant_789"
order_id: "ord_012"
action: "checkout"
amount_cents: 4999
currency: "USD"
# Error context
error:
type: "PaymentDeclinedError"
message: "Card declined: insufficient funds"
code: "CARD_DECLINED"
stack: "..." # Only in non-production or DEBUG level
retry_count: 2
retryable: true
Log Level Decision Tree
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-1kalin-afrexai-observability-engine": {
"enabled": true,
"auto_update": true
}
}
}Tags
Related Skills
doctorbot-ci-validator
Stop failing in production. Validate your GitHub Actions, GitLab CI & Keep workflows offline with surgical precision. Born from Keep bounty research, perfected for agents.
health-guardian
Proactive health monitoring for AI agents. Apple Health integration, pattern detection, anomaly alerts. Built for agents caring for humans with chronic conditions.
codex-review
Three-tier code quality defense: L1 quick scan, L2 deep audit (via bug-audit), L3 cross-validation with adversarial testing. 三级代码质量防线。
openclaw-security-monitor
Proactive security monitoring, threat scanning, and auto-remediation for OpenClaw deployments
cron-doctor
Diagnose and triage cron job failures. Checks job states, identifies error patterns, prioritizes by criticality, generates health reports. Triggers on: cron failures, job health check, scheduled task errors, cron diagnosis, job not running, backup failed.