Official Verified media Safety 4/5

Pocket-TTS

Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.

Why use this skill?

Integrate lightweight, CPU-friendly voice cloning and streaming speech synthesis into your AI agents with Pocket-TTS. No GPU required.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/leonaaardob/lb-pocket-tts-skill

Download Source Code (.zip)

What This Skill Does

Pocket-TTS is a high-performance, lightweight text-to-speech engine optimized for CPU execution. Unlike resource-heavy models that require GPUs, this skill leverages the efficient Kyutai architecture to deliver near-instantaneous audio generation. It supports advanced voice cloning using short audio samples (3-10 seconds) and produces high-quality 24kHz mono WAV output. The core strength of this skill lies in its ability to stream audio with low latency (~200ms to first chunk), making it ideal for interactive AI agents or real-time synthesis tasks on hardware like modern MacBooks or standard cloud instances.

Installation

To integrate this skill into your environment, use the OpenClaw management command: clawhub install openclaw/skills/skills/leonaaardob/lb-pocket-tts-skill Ensure you have the underlying Python dependencies by installing the package via pip or uv: pip install pocket-tts

Use Cases

Real-time AI Interaction: Add a voice layer to your OpenClaw agents without triggering expensive GPU costs.
Content Creation: Batch generate voiceovers for local media projects or prototypes.
Voice Cloning Experiments: Quickly prototype custom brand voices using short, existing audio clips.
Streaming Services: Implement real-time text-to-audio feedback loops in a local server environment.

Example Prompts

"Use the Pocket-TTS skill to generate an audio file saying 'Welcome to the system' using the voice profile stored at ./my_voice.wav."
"Stream the following text: 'The report is ready for review' using the default voice and save the output chunks to the current directory."
"Convert my local voice recording into a .safetensors embedding using Pocket-TTS to speed up future generation tasks."

Tips & Limitations

Optimization: For faster startup times, always pre-export your voice prompts into .safetensors format rather than raw .wav or .mp3 files.
Language Support: Currently, this model is strictly English-only. Do not attempt to use it for multilingual synthesis as results may be unpredictable.
Quality Tuning: Adjust the temperature (0.5-1.0) and LSD decode steps (1-5) to balance between generation speed and audio fidelity. Lowering decode steps is excellent for raw speed, while higher values improve clarity.

Read Full Documentation on GitHub

Metadata

Author@leonaaardob

Stars1656

Updated2026-02-28

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-leonaaardob-lb-pocket-tts-skill": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Related Skills

narrator-ai-cli

Create AI-narrated film/drama commentary videos via CLI. Two workflow paths (Original & Adapted narration), 100+ movies, 146 BGM tracks, 63 dubbing voices in 11 languages, 90+ narration templates. Use when creating narration videos, film commentary, short drama dubbing, or video production.

4myhime 4473