ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified media Safety 4/5

Pocket-TTS

Generate speech from text using Kyutai Pocket TTS - lightweight, CPU-friendly, streaming TTS with voice cloning. English only. ~6x real-time on M4 MacBook Air.

Why use this skill?

Integrate lightweight, CPU-friendly voice cloning and streaming speech synthesis into your AI agents with Pocket-TTS. No GPU required.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/leonaaardob/lb-pocket-tts-skill
Or

What This Skill Does

Pocket-TTS is a high-performance, lightweight text-to-speech engine optimized for CPU execution. Unlike resource-heavy models that require GPUs, this skill leverages the efficient Kyutai architecture to deliver near-instantaneous audio generation. It supports advanced voice cloning using short audio samples (3-10 seconds) and produces high-quality 24kHz mono WAV output. The core strength of this skill lies in its ability to stream audio with low latency (~200ms to first chunk), making it ideal for interactive AI agents or real-time synthesis tasks on hardware like modern MacBooks or standard cloud instances.

Installation

To integrate this skill into your environment, use the OpenClaw management command: clawhub install openclaw/skills/skills/leonaaardob/lb-pocket-tts-skill Ensure you have the underlying Python dependencies by installing the package via pip or uv: pip install pocket-tts

Use Cases

  • Real-time AI Interaction: Add a voice layer to your OpenClaw agents without triggering expensive GPU costs.
  • Content Creation: Batch generate voiceovers for local media projects or prototypes.
  • Voice Cloning Experiments: Quickly prototype custom brand voices using short, existing audio clips.
  • Streaming Services: Implement real-time text-to-audio feedback loops in a local server environment.

Example Prompts

  1. "Use the Pocket-TTS skill to generate an audio file saying 'Welcome to the system' using the voice profile stored at ./my_voice.wav."
  2. "Stream the following text: 'The report is ready for review' using the default voice and save the output chunks to the current directory."
  3. "Convert my local voice recording into a .safetensors embedding using Pocket-TTS to speed up future generation tasks."

Tips & Limitations

  • Optimization: For faster startup times, always pre-export your voice prompts into .safetensors format rather than raw .wav or .mp3 files.
  • Language Support: Currently, this model is strictly English-only. Do not attempt to use it for multilingual synthesis as results may be unpredictable.
  • Quality Tuning: Adjust the temperature (0.5-1.0) and LSD decode steps (1-5) to balance between generation speed and audio fidelity. Lowering decode steps is excellent for raw speed, while higher values improve clarity.

Metadata

Stars1656
Views1
Updated2026-02-28
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-leonaaardob-lb-pocket-tts-skill": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags

#kyutai#text-to-speech#tts#cpu#streaming#voice-cloning
Safety Score: 4/5

Flags: file-read, file-write