afrexai-observability-engine
Complete observability & reliability engineering system. Use when designing monitoring, implementing structured logging, setting up distributed tracing, building alerting systems, creating SLO/SLI frameworks, running incident response, conducting post-mortems, or auditing system reliability. Covers all three pillars (logs/metrics/traces), alert design, dashboard architecture, on-call operations, chaos engineering, and cost optimization.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/1kalin/afrexai-observability-engineWhat This Skill Does
The afrexai-observability-engine is a comprehensive framework designed to establish production-grade observability and reliability engineering within your software stack. It serves as an architectural advisor and implementation assistant for the three pillars of observability: logging, metrics, and tracing. By moving away from unstructured print statements toward standardized, context-rich JSON logging, this skill helps you debug complex distributed systems, reduce mean-time-to-recovery (MTTR), and establish proactive alerting systems.
Beyond basic monitoring, this skill provides templates for SLO/SLI frameworks, incident response protocols, and chaos engineering experiments. It allows you to transform from a reactive posture—where you only know a system is down after a customer complains—to a proactive posture where you monitor error budgets, track system health via RED/USE metrics, and optimize your cloud observability spend.
Installation
To integrate this skill into your OpenClaw environment, execute the following command in your terminal:
clawhub install openclaw/skills/skills/1kalin/afrexai-observability-engine
Use Cases
- Production Audits: Use the 'Quick Health Check' table to evaluate your current stack maturity and identify immediate gaps in your reliability roadmap.
- Log Standardization: Standardize your logging output by implementing the mandatory field schema (timestamp, level, service, trace_id, etc.) to enable cross-service request correlation.
- Incident Management: Design automated incident response processes, including role definition and post-mortem templates that drive continuous learning.
- SLO Implementation: Define Service Level Objectives that align technical reliability targets with business outcomes, ensuring your development cycle is protected by measured error budgets.
Example Prompts
- "Analyze my current observability setup: I have basic metrics and logs but no structured tracing. How do I bridge the gap for a microservices architecture?"
- "Draft a post-mortem template for a P0 database incident that focuses on blameless root-cause analysis and actionable follow-up tasks."
- "Help me design an SLO for my checkout service. What should the SLI be, and how do I calculate the error budget based on a 99.9% availability target?"
Tips & Limitations
- Tip: Always start by standardizing your logging structure before attempting to build complex dashboards; logs are the foundation upon which your metrics and traces will eventually rely.
- Tip: When configuring alerts, prioritize noise reduction by implementing SLO-based alerting rather than raw threshold alerts.
- Limitation: This skill acts as an architectural guide. It provides the frameworks, schema, and best practices, but actual implementation of log shippers (like Fluentd or Logstash) or metric backends (like Prometheus or Datadog) still requires manual configuration in your infrastructure.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-1kalin-afrexai-observability-engine": {
"enabled": true,
"auto_update": true
}
}
}Tags
Flags: data-collection
Related Skills
doctorbot-ci-validator
Stop failing in production. Validate your GitHub Actions, GitLab CI & Keep workflows offline with surgical precision. Born from Keep bounty research, perfected for agents.
incident-postmortem-assistant
将事故线索整理成复盘草案,区分根因、诱因、放大器、影响与修复动作。;use for incident, postmortem, sre workflows;do not use for 归责个人, 篡改时间线.
securityvitals
Security vitals checker for OpenClaw. Scans your installation, scores your setup, and shows you exactly what to fix. First scan in seconds.
sealvera
Tamper-evident audit trail for AI agent decisions. Use when logging LLM decisions, setting up AI compliance, auditing agents for EU AI Act, HIPAA, GDPR or SOC 2, or when a user asks about AI decision audit trails, explainability, or SealVera.
afrexai-startup-metrics-engine
Complete startup metrics command center — from raw data to investor-ready dashboards. Covers every stage (pre-seed to Series B+), every model (SaaS, marketplace, consumer, hardware), with diagnostic frameworks, benchmark databases, and board-ready reporting.