Official Verified developer tools Safety 4/5

sre

SRE expert for incident response, production troubleshooting, root cause analysis, post-mortems, and runbooks. Use for outages, performance issues, or SEV incidents.

Why use this skill?

Optimize your production stability with the SRE skill for OpenClaw. Expert-level guidance for incident response, root cause analysis, and professional post-mortem generation.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/anton-abyzov/sw-sre

Download Source Code (.zip)

What This Skill Does

The SRE skill transforms your OpenClaw agent into a high-level Site Reliability Engineering expert. Designed to handle the pressures of production environments, this skill provides structured support for incident response, real-time performance troubleshooting, root cause analysis (RCA), and the creation of formal post-mortem reports. Whether you are dealing with a critical SEV-1 outage or investigating subtle latency spikes in a distributed microservices architecture, the SRE agent helps you maintain system stability by guiding you through industry-standard methodologies.

Installation

To integrate the SRE expert into your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/anton-abyzov/sw-sre Ensure you have the latest version of OpenClaw installed to maintain compatibility with the repository features.

Use Cases

Incident Response: Immediate guidance during system outages or performance degradation, providing step-by-step triage protocols.
Root Cause Analysis: Parsing logs and metrics to identify the underlying source of failure, moving beyond superficial symptoms to architectural weaknesses.
Post-Mortem Documentation: Drafting professional, comprehensive incident reports that fulfill organizational compliance and learning requirements.
Runbook Generation: Creating automated or manual operation guides to ensure repeatable resolution for recurring system issues.

Example Prompts

"We are seeing a 500-series error spike on the payment gateway following the latest deployment; please help me triage the logs to find the root cause."
"Draft a post-mortem document for the database outage we experienced yesterday, focusing on the mitigation timeline and preventive measures for connection pooling."
"Create a standard operating procedure (SOP) runbook for clearing stale cache clusters in our Redis instance during peak traffic hours."

Tips & Limitations

When dealing with massive datasets or extensive multi-layered incidents, the SRE agent utilizes an incremental generation strategy. If your report exceeds 1000 lines, the agent will pause and ask you which phase (Triage, RCA, Mitigation, or Prevention) you would like to proceed with next. This mechanism is critical for maintaining stability during complex generation tasks. Always verify the agent's output against your production environment's specific topology, as context regarding your private infrastructure remains the user's responsibility.

Read Full Documentation on GitHub

Metadata

Author@anton-abyzov

Stars1100

Updated2026-02-17

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-anton-abyzov-sw-sre": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#sre#incident-response#devops#monitoring#reliability

Safety Score: 4/5

Flags: file-read, code-execution

Related Skills

network-engineer

Cloud network architect for VPC design, service mesh, zero-trust networking, load balancers, and CDN optimization. Use for network troubleshooting or connectivity issues.

anton-abyzov 1100

jira-multi-project-mapper

Expert in mapping SpecWeave specs to multiple JIRA projects with intelligent project detection and cross-project coordination. Use when syncing to multiple JIRA projects (project-per-team, component-based), or managing bidirectional sync across team boundaries.

anton-abyzov 1100

helm-chart-scaffolding

Design, organize, and manage Helm charts for templating and packaging Kubernetes applications with reusable configurations. Use when creating Helm charts, packaging Kubernetes applications, or implementing templated deployments.

anton-abyzov 1100

performance-optimization

React Native performance with Hermes V1, FlashList, expo-image v2, concurrent rendering. Use for slow app, memory leaks, or FPS issues.

anton-abyzov 1100

release-strategy-advisor

Release strategy advisor - detects brownfield patterns (tags, CI/CD, changelogs), recommends versioning strategy based on architecture. Creates release-strategy.md.

anton-abyzov 1100