cosyvoice3
Local text-to-speech using Alibaba's CosyVoice3 on macOS Apple Silicon. Supports Chinese, English, Japanese, Korean, and 18+ Chinese dialects. Provides zero-shot voice cloning, cross-lingual synthesis, and fine-grained control. Use when: (1) User requests local TTS with high-quality Chinese/English voices. (2) Need voice cloning from reference audio. (3) Offline/inference TTS is required. (4) User wants natural-sounding speech with emotion/dialect control.
Why use this skill?
High-quality local TTS for Apple Silicon. Supports 9 languages, 18+ Chinese dialects, and zero-shot voice cloning with fine-grained emotional control.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/lhuaizhong/cosyvoice3-macosWhat This Skill Does
CosyVoice3 is a state-of-the-art text-to-speech (TTS) engine by Alibaba, optimized specifically for Apple Silicon (M1/M2/M3) hardware via OpenClaw. This skill provides high-fidelity, natural-sounding speech synthesis supporting 9 major languages (including English, Chinese, Japanese, and Korean) and over 18 distinct Chinese dialects. Unlike cloud-based TTS solutions, this runs entirely locally on your machine, ensuring data privacy and offline accessibility. It features powerful zero-shot voice cloning capabilities, allowing you to synthesize speech that mimics a specific person's timbre using only 3-10 seconds of reference audio. Additionally, it supports cross-lingual synthesis, meaning you can generate English speech using a Chinese voice profile, and provides fine-grained control over prosody, speed, and emotional inflection via text-based tags.
Installation
To install this skill, execute the following command in your OpenClaw environment:
clawhub install openclaw/skills/skills/lhuaizhong/cosyvoice3-macos
After installation, navigate to /Users/lhz/.openclaw/workspace/skills/cosyvoice3/scripts and run bash install.sh. This process will automatically set up a dedicated Conda environment, configure the necessary PyTorch dependencies for your Apple Silicon hardware, and download the Fun-CosyVoice3-0.5B model weights.
Use Cases
- Professional Voice Overs: Generate high-quality narration for videos or presentations without expensive studio equipment.
- Content Localization: Easily translate and synthesize scripts into multiple languages while maintaining a consistent voice identity.
- Accessibility & Assistive Tech: Create natural, human-like voice feedback for applications or reading assistants.
- Creative AI Projects: Clone voices for character narration in games, animations, or personalized audiobooks.
Example Prompts
- "Use CosyVoice3 to narrate this article in a calm, professional tone using the default female voice."
- "Clone my voice from 'reference.wav' and read the following text: 'Hello, this is a test of my synthetic twin.'"
- "Synthesize this Chinese script into English using the voice from my saved assets, and set the speed to 1.2x."
Tips & Limitations
- Reference Audio: When performing zero-shot cloning, ensure the audio is clear, free of background noise, and 3-10 seconds long for best results.
- Tagging: Always include the
<|endofprompt|>token in your reference text segments to help the model distinguish between prompt content and generated output. - Performance: While Apple Silicon is efficient, generating very long audio clips may take time. Break large blocks of text into smaller paragraphs for faster synthesis.
- Storage: Ensure you have at least 5GB of free space before beginning the installation.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-lhuaizhong-cosyvoice3-macos": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-read, file-write, code-execution