ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified developer tools Safety 4/5

sre

SRE expert for incident response, production troubleshooting, root cause analysis, post-mortems, and runbooks. Use for outages, performance issues, or SEV incidents.

Why use this skill?

Optimize your production stability with the SRE skill for OpenClaw. Expert-level guidance for incident response, root cause analysis, and professional post-mortem generation.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/anton-abyzov/sw-sre
Or

What This Skill Does

The SRE skill transforms your OpenClaw agent into a high-level Site Reliability Engineering expert. Designed to handle the pressures of production environments, this skill provides structured support for incident response, real-time performance troubleshooting, root cause analysis (RCA), and the creation of formal post-mortem reports. Whether you are dealing with a critical SEV-1 outage or investigating subtle latency spikes in a distributed microservices architecture, the SRE agent helps you maintain system stability by guiding you through industry-standard methodologies.

Installation

To integrate the SRE expert into your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/anton-abyzov/sw-sre Ensure you have the latest version of OpenClaw installed to maintain compatibility with the repository features.

Use Cases

  • Incident Response: Immediate guidance during system outages or performance degradation, providing step-by-step triage protocols.
  • Root Cause Analysis: Parsing logs and metrics to identify the underlying source of failure, moving beyond superficial symptoms to architectural weaknesses.
  • Post-Mortem Documentation: Drafting professional, comprehensive incident reports that fulfill organizational compliance and learning requirements.
  • Runbook Generation: Creating automated or manual operation guides to ensure repeatable resolution for recurring system issues.

Example Prompts

  1. "We are seeing a 500-series error spike on the payment gateway following the latest deployment; please help me triage the logs to find the root cause."
  2. "Draft a post-mortem document for the database outage we experienced yesterday, focusing on the mitigation timeline and preventive measures for connection pooling."
  3. "Create a standard operating procedure (SOP) runbook for clearing stale cache clusters in our Redis instance during peak traffic hours."

Tips & Limitations

When dealing with massive datasets or extensive multi-layered incidents, the SRE agent utilizes an incremental generation strategy. If your report exceeds 1000 lines, the agent will pause and ask you which phase (Triage, RCA, Mitigation, or Prevention) you would like to proceed with next. This mechanism is critical for maintaining stability during complex generation tasks. Always verify the agent's output against your production environment's specific topology, as context regarding your private infrastructure remains the user's responsibility.

Metadata

Stars1100
Views0
Updated2026-02-17
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-anton-abyzov-sw-sre": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#sre#incident-response#devops#monitoring#reliability
Safety Score: 4/5

Flags: file-read, code-execution