Official Verified developer tools Safety 4/5

kafka-observability

Kafka monitoring and observability expert for Prometheus, Grafana, and JMX metrics. Use when setting up Kafka monitoring, configuring alerting rules, or building performance dashboards.

Why use this skill?

Master Kafka monitoring with automated JMX exporter setup, pre-built Grafana dashboards, and Prometheus configuration for cluster health, performance tracking, and consumer lag.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/anton-abyzov/sw-kafka-observability

Download Source Code (.zip)

What This Skill Does

The kafka-observability skill acts as a specialized assistant for monitoring Apache Kafka clusters. It provides a structured approach to setting up the JMX Exporter, configuring Prometheus scraping, and deploying production-ready Grafana dashboards. By automating the integration of JVM-level performance data with Kafka-specific throughput metrics, it enables teams to move from reactive troubleshooting to proactive cluster management. The skill encapsulates best-practice configurations for heap, garbage collection, and thread monitoring alongside critical broker-level health metrics like replica lag and controller status.

Installation

To integrate this skill into your environment, use the OpenClaw CLI or interface:

clawhub install openclaw/skills/skills/anton-abyzov/sw-kafka-observability

Once installed, verify the availability of the monitoring component path: plugins/specweave-kafka/monitoring/. For self-hosted deployments, ensure your Kafka service environment variables include the KAFKA_OPTS definition as specified in the plugin documentation to attach the JMX Prometheus agent. If using Kubernetes, map the provided kafka-jmx-exporter.yml config file into your Prometheus-JMX container sidecar.

Use Cases

Proactive Cluster Health: Monitor controller elections and under-replicated partitions before they trigger production outages.
Performance Tuning: Track request latencies and produce/fetch rates to identify bottlenecks in your broker network.
Capacity Planning: Use historical consumer lag and throughput metrics to determine when to scale partitions or add new broker nodes.
Operational Troubleshooting: Quickly correlate JVM garbage collection spikes with periodic latency dips in message processing.

Example Prompts

"I need to monitor under-replicated partitions; can you show me the Prometheus query for the under-replicated partition count and how to alert on it?"
"Help me set up the Kafka-JMX exporter on our bare-metal cluster; what environment variables do I need to add to the systemd script?"
"My consumers are slowing down. How can I use the kafka-consumer-lag dashboard to identify if it is a consumer processing issue or a broker throughput bottleneck?"

Tips & Limitations

Memory Overhead: Be aware that attaching the Java agent adds a small memory overhead to your Kafka process; ensure your heap sizes are configured to account for this.
Dashboards: The provided Grafana dashboards expect the Prometheus datasource to be named 'Prometheus'. Adjust your data source configurations if your setup uses a different alias.
Alerting: While the dashboards provide visualization, you must independently configure your Alertmanager rules in Prometheus for PagerDuty or Slack notifications.

Read Full Documentation on GitHub

Metadata

Author@anton-abyzov

Stars1054

Updated2026-02-16

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-anton-abyzov-sw-kafka-observability": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#kafka#prometheus#grafana#observability#devops

Safety Score: 4/5

Flags: file-read, file-write, code-execution

Related Skills

network-engineer

Cloud network architect for VPC design, service mesh, zero-trust networking, load balancers, and CDN optimization. Use for network troubleshooting or connectivity issues.

anton-abyzov 1100

jira-multi-project-mapper

Expert in mapping SpecWeave specs to multiple JIRA projects with intelligent project detection and cross-project coordination. Use when syncing to multiple JIRA projects (project-per-team, component-based), or managing bidirectional sync across team boundaries.

anton-abyzov 1100

helm-chart-scaffolding

Design, organize, and manage Helm charts for templating and packaging Kubernetes applications with reusable configurations. Use when creating Helm charts, packaging Kubernetes applications, or implementing templated deployments.

anton-abyzov 1100

performance-optimization

React Native performance with Hermes V1, FlashList, expo-image v2, concurrent rendering. Use for slow app, memory leaks, or FPS issues.

anton-abyzov 1100

release-strategy-advisor

Release strategy advisor - detects brownfield patterns (tags, CI/CD, changelogs), recommends versioning strategy based on architecture. Creates release-strategy.md.

anton-abyzov 1100