qwen-tts
Local text-to-speech using Qwen3-TTS-12Hz-1.7B-CustomVoice. Use when generating audio from text, creating voice messages, or when TTS is requested. Supports 10 languages including Italian, 9 premium speaker voices, and instruction-based voice control (emotion, tone, style). Alternative to cloud-based TTS services like ElevenLabs. Runs entirely offline after initial model download.
Why use this skill?
Generate high-quality, offline text-to-speech with OpenClaw. Features 9 premium voices, 10 languages, and emotional voice control using the powerful Qwen3-TTS-12Hz-1.7B model.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/paki81/qwen-ttsWhat This Skill Does
The qwen-tts skill is a high-performance, local text-to-speech engine powered by the Qwen3-TTS-12Hz-1.7B-CustomVoice model. It allows OpenClaw users to generate high-quality, expressive synthetic audio directly on their local hardware, bypassing the need for expensive and privacy-invasive cloud APIs like ElevenLabs. With support for 10 languages—including Italian, English, and Japanese—and a suite of 9 distinct premium speaker voices, this skill provides unparalleled control over vocal output, including emotional nuances and stylistic delivery via instruction-based prompts.
Installation
To integrate this skill into your OpenClaw environment, execute the following command: clawhub install openclaw/skills/skills/paki81/qwen-tts. Once installed, navigate to the skill directory at skills/public/qwen-tts and run the bash scripts/setup.sh command. This will initialize a dedicated virtual environment and download the necessary dependencies. Note that the first time you execute a speech synthesis task, the system will automatically download the 1.7GB model weight file from Hugging Face. Ensure you have sufficient disk space and a stable internet connection for this one-time initial setup.
Use Cases
This skill is perfect for creators needing local, offline voiceovers for multimedia projects. It is ideal for developers building voice-enabled applications, creating dynamic accessibility features for desktop tools, or generating interactive narrations within OpenClaw workflows. Because the model runs locally, it is suitable for sensitive data where privacy is paramount, as no audio data is transmitted to external servers.
Example Prompts
- "OpenClaw, use qwen-tts to generate an Italian audio file saying 'Benvenuto nel futuro del text-to-speech' using the Vivian voice and save it as welcome.wav."
- "Create a voice message using the Ryan speaker in English that says 'Hello, nice to meet you' with an enthusiastic and energetic tone."
- "Please list all available speakers for the qwen-tts module so I can choose the best voice for my narrations."
Tips & Limitations
For optimal results, prioritize using a speaker's native language, though the model is cross-lingually capable. Use the -i flag to manipulate output style, such as 'Parla con entusiasmo' for Italian or 'Read like a narrator' for English. Since the model relies on a local 1.7B parameter footprint, ensure your system has sufficient RAM to process requests smoothly. As an offline model, it is limited to the predefined voice library and does not perform voice cloning of arbitrary audio samples.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-paki81-qwen-tts": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-write, file-read, code-execution