What This Skill Does

The vibevoice skill provides high-fidelity, local Text-to-Speech (TTS) capabilities for the OpenClaw AI agent, specifically optimized for generating Spanish audio. Powered by Microsoft's VibeVoice model, this skill converts text strings into natural-sounding voice files. It is uniquely engineered to support WhatsApp-native voice messaging, producing high-quality Opus-encoded .ogg files that simulate authentic human speech patterns. Unlike cloud-based APIs, vibevoice runs entirely offline, ensuring privacy, zero latency costs, and continuous availability even without an internet connection. It features adjustable speed settings and multiple voice profiles, allowing users to customize the output to match specific persona requirements or conversational contexts.

Installation

To integrate this skill into your local environment, ensure you have an NVIDIA GPU with approximately 2GB of dedicated VRAM, Python 3.10 or higher, and FFmpeg installed. The installation process is streamlined through the ClawHub platform. Execute the following command in your terminal:

clawhub install openclaw/skills/skills/javier887/vibevoice

Ensure that the VibeVoice repository is cloned correctly into your home directory at ~/VibeVoice. The installer will automatically configure the dependencies, including PyTorch and Torchaudio libraries required for real-time model inference.

Use Cases

This skill is ideal for personal assistants that need to maintain a human-like presence on messaging platforms. Key use cases include:

Automating responses to WhatsApp voice notes by replying in the user's preferred language and tone.
Creating accessibility features for visually impaired users by converting documentation or text summaries into audio.
Generating localized Spanish-language notifications or alerts that sound natural rather than robotic.
Enhancing productivity by having long-form text documents converted into audio for on-the-go listening.

Example Prompts

"Translate this text to Spanish and send it as a voice note to my friend Juan via WhatsApp: 'I will be there in ten minutes.'"
"Read the summary of this report and generate a voice file with a 1.2x speed setting."
"Send an audio response to the last message from Maria using the default Spanish male voice."

Tips & Limitations

Performance: The model achieves an RTF of 0.24x, meaning a 60-second message generates in about 15 seconds. Expect a brief 10-second initialization delay upon the first launch.
Content Length: For optimal quality, limit individual text inputs to 1500 characters. Longer texts should be broken into chunks to avoid audio artifacts.
Audio Rules: Adhere to social etiquette by only sending voice messages when requested or when responding to existing audio threads.
Storage: Files are saved to temporary directories by default; ensure your system permissions allow file-write access for the script directory.

vibevoice

Why use this skill?

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)