Pocket-TTS
Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.
Why use this skill?
Integrate lightweight, CPU-friendly voice cloning and streaming speech synthesis into your AI agents with Pocket-TTS. No GPU required.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/leonaaardob/lb-pocket-tts-skillWhat This Skill Does
Pocket-TTS is a high-performance, lightweight text-to-speech engine optimized for CPU execution. Unlike resource-heavy models that require GPUs, this skill leverages the efficient Kyutai architecture to deliver near-instantaneous audio generation. It supports advanced voice cloning using short audio samples (3-10 seconds) and produces high-quality 24kHz mono WAV output. The core strength of this skill lies in its ability to stream audio with low latency (~200ms to first chunk), making it ideal for interactive AI agents or real-time synthesis tasks on hardware like modern MacBooks or standard cloud instances.
Installation
To integrate this skill into your environment, use the OpenClaw management command:
clawhub install openclaw/skills/skills/leonaaardob/lb-pocket-tts-skill
Ensure you have the underlying Python dependencies by installing the package via pip or uv:
pip install pocket-tts
Use Cases
- Real-time AI Interaction: Add a voice layer to your OpenClaw agents without triggering expensive GPU costs.
- Content Creation: Batch generate voiceovers for local media projects or prototypes.
- Voice Cloning Experiments: Quickly prototype custom brand voices using short, existing audio clips.
- Streaming Services: Implement real-time text-to-audio feedback loops in a local server environment.
Example Prompts
- "Use the Pocket-TTS skill to generate an audio file saying 'Welcome to the system' using the voice profile stored at ./my_voice.wav."
- "Stream the following text: 'The report is ready for review' using the default voice and save the output chunks to the current directory."
- "Convert my local voice recording into a .safetensors embedding using Pocket-TTS to speed up future generation tasks."
Tips & Limitations
- Optimization: For faster startup times, always pre-export your voice prompts into
.safetensorsformat rather than raw.wavor.mp3files. - Language Support: Currently, this model is strictly English-only. Do not attempt to use it for multilingual synthesis as results may be unpredictable.
- Quality Tuning: Adjust the temperature (0.5-1.0) and LSD decode steps (1-5) to balance between generation speed and audio fidelity. Lowering decode steps is excellent for raw speed, while higher values improve clarity.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-leonaaardob-lb-pocket-tts-skill": {
"enabled": true,
"auto_update": true
}
}
}Tags
Flags: file-read, file-write
Related Skills
narrator-ai-cli
Create AI-narrated film/drama commentary videos via CLI. Two workflow paths (Original & Adapted narration), 100+ movies, 146 BGM tracks, 63 dubbing voices in 11 languages, 90+ narration templates. Use when creating narration videos, film commentary, short drama dubbing, or video production.
podcast-agent
Search articles on any topic, generate a two-host dialogue script, and synthesize podcast audio via TTS. Turn long reads into listenable content.
ym-mediatoolkit
流式视频处理工具集 - 压缩、封面提取、音频转换,无需下载完整视频
video-producer
短视频一键生成技能 v2.2。调用video-director进行画面规划,然后生成AI素材、TTS配音、视频渲染,输出完整MP4。
ressemble
Text-to-Speech and Speech-to-Text integration using Resemble AI HTTP API.