What This Skill Does

The Pocket TTS skill by sherajdev brings high-quality, real-time text-to-speech capabilities directly to your local machine using Kyutai’s advanced Pocket TTS model. Unlike cloud-based alternatives, this skill operates entirely offline, ensuring maximum privacy and zero latency costs associated with network requests. It is designed to be lightweight, running efficiently on just two CPU cores without requiring a dedicated GPU. The model provides eight high-quality built-in voices and supports advanced voice cloning, allowing users to generate speech that sounds like a specific individual by providing a reference WAV file. Whether you are building an interactive AI agent, generating audio for creative projects, or integrating accessibility features into local applications, this skill offers a robust, developer-friendly Python API and a convenient CLI.

Installation

To get started, first ensure you have accepted the license agreement for the Kyutai Pocket TTS model on Hugging Face. You can install the skill directly via the OpenClaw CLI using the command: clawhub install openclaw/skills/skills/sherajdev/pocket-tts. Alternatively, if you are working within a standard Python environment, use pip install pocket-tts or uvx pocket-tts. The model will automatically download its parameters (~100M) upon the first execution, so ensure you have a stable connection for the initial setup.

Use Cases

This skill is ideal for developers creating local-only AI agents that require voice output. Because it runs on CPU, it is perfect for deploying on edge devices, laptops, or servers without expensive hardware. Use it to provide natural-sounding voice feedback for automation tasks, create automated narration for local media projects, or build personalized AI personas through its unique voice-cloning capabilities.

Example Prompts

"Speak the following text using the alba voice: 'System status is optimal and all services are running.'"
"Generate an audio file named briefing.wav using the javert voice with a speed of 1.1x."
"Clone my voice from recording.wav and use it to say 'Hello, how can I assist you with your tasks today?'"

Tips & Limitations

The model is currently optimized for English language output (v1). While the speed can be adjusted between 0.5x and 2.0x, staying closer to 1.0x usually yields the most natural inflection. Remember that the model requires a valid local WAV file for voice cloning; ensure your input samples are high quality and clear to achieve the best results. Since it runs offline, the performance is strictly limited by your CPU architecture, though it is highly optimized for performance and typically runs at 2-6x real-time speed.

Pocket Tts

Why use this skill?

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)

Related Skills

deploy-agent