sre-practices
Deep SRE workflow—SLOs/SLIs, error budgets, alerting, toil reduction, incident readiness, capacity, and balancing reliability with delivery. Use when improving production culture, defining service reliability targets, or reducing on-call pain.
Why use this skill?
Master SRE workflows, SLO/SLI definition, and error budget management. Reduce toil, improve system reliability, and balance feature velocity using expert-guided SRE frameworks.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/codekungfu/sre-practicesWhat This Skill Does
The sre-practices skill provides a comprehensive framework for engineering reliability within your technical ecosystem. It is designed to move teams beyond reactive, fire-fighting operations toward a proactive, data-driven SRE culture. The skill structures the complex SRE lifecycle into six distinct stages: defining user-centric SLIs, establishing realistic SLO targets, creating clear error budget policies, optimizing alerting, managing toil, and ensuring continuous improvement. By providing this structure, the agent helps stakeholders balance feature velocity with system stability, ensuring that reliability investments are prioritized based on business impact and actual user experience rather than arbitrary technical goals.
Installation
To integrate this skill, use the following command in your terminal:
clawhub install openclaw/skills/skills/codekungfu/sre-practices
Use Cases
This skill is ideal for:
- Engineering Managers: Seeking to implement a data-backed reliability culture that aligns product roadmap with infrastructure stability.
- SRE/DevOps Teams: Trying to reduce on-call fatigue by auditing existing alert noise and streamlining runbook processes.
- Product Teams: Looking to understand how 'error budgets' can prevent burnout and ensure that both high-speed deployment and system uptime are achieved through shared responsibility.
- Systems Architects: Assessing the feasibility of reliability goals (like 'five nines') against third-party dependency limitations.
Example Prompts
- "We are seeing high alert fatigue on our checkout service. Can you help me audit our current alerts and align them with a new SLO strategy?"
- "Help me draft an error budget policy. I want to define clear thresholds for when we must freeze feature releases to focus on technical debt."
- "Our team is struggling with high toil on manual database migration tasks. Walk me through a strategy to identify and automate these tasks to reduce our toil budget."
Tips & Limitations
- Cultural Dependency: This skill is most effective when organizational leadership is committed to the shared-ownership model of reliability. It will not work if SRE is treated solely as a 'policing' function.
- Start Small: Do not attempt to instrument every microservice at once. Focus on your most critical 'Golden Signal' user journeys first to build momentum.
- Iterate on Thresholds: SLOs are not static. Use the continuous improvement stage to review your targets every quarter, as user needs and system architectures evolve.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-codekungfu-sre-practices": {
"enabled": true,
"auto_update": true
}
}
}