What This Skill Does

The 'speak' skill integrates Kokoro local TTS into the OpenClaw ecosystem, enabling high-quality, high-speed text-to-speech synthesis directly on your local machine. By leveraging the Kokoro-TTS engine, this skill converts any textual input—whether it be raw strings, text files, or complex documents like EPUB and PDF—into human-sounding audio files. Because the processing occurs locally, it offers exceptional performance without the latency or privacy concerns of cloud-based APIs. It supports multiple languages, including English (US and GB), Chinese, Japanese, French, and Italian, and provides advanced features like voice blending, playback speed adjustment, and direct streaming without the need to persist files to disk.

Installation

To begin using this skill, ensure you have the required environment set up. First, install the skill via the OpenClaw registry: clawhub install openclaw/skills/skills/babysor/speak1

Next, install the Kokoro-TTS engine via uv: uv tool install kokoro-tts

Finally, ensure the necessary model weights are present in your working directory. You must download the model files (kokoro-v1.0.onnx and voices-v1.0.bin) using the following commands: wget https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/kokoro-v1.0.onnx wget https://github.com/nazdridoy/kokoro-tts/releases/download/v1.0.0/voices-v1.0.bin

Use Cases

Accessibility: Convert reading materials, articles, or notes into audio files for users who prefer auditory learning or require assistance.
Content Creation: Generate voiceovers for videos or presentations by providing scripts directly to the skill.
Document Consumption: Listen to long-form content, such as e-books or research papers, while multitasking.
Interactive Feedback: Enhance OpenClaw agent responses by having the AI "speak" its findings rather than just displaying text.

Example Prompts

"Read this README file aloud using the af_sarah voice so I can listen to it while I walk."
"Convert my article.txt file into an audio recording, but make the pace 1.2x faster."
"Please generate a spoken response to my recent message using a blend of 60% af_sarah and 40% am_adam for a unique sound."

Tips & Limitations

Resource Usage: Because this performs local inference, ensure your machine has adequate CPU/GPU headroom for smooth, real-time playback.
Voice Selection: Always verify that the voice identifier used is compatible with the target language. For instance, using a Mandarin voice for English text may lead to unpredictable pronunciation.
Environment: The model files must be in your current execution directory. If your agent is running in a different working directory, ensure you provide the absolute path to the weights or set them in the environment path.
Streaming: Use the --stream flag if you only need temporary audio feedback, as it reduces storage clutter and skips unnecessary disk write operations.

speak

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)

Related Skills

speak