What This Skill Does

The qwen-tts skill is a high-performance, local text-to-speech engine powered by the Qwen3-TTS-12Hz-1.7B-CustomVoice model. It allows OpenClaw users to generate high-quality, expressive synthetic audio directly on their local hardware, bypassing the need for expensive and privacy-invasive cloud APIs like ElevenLabs. With support for 10 languages—including Italian, English, and Japanese—and a suite of 9 distinct premium speaker voices, this skill provides unparalleled control over vocal output, including emotional nuances and stylistic delivery via instruction-based prompts.

Installation

To integrate this skill into your OpenClaw environment, execute the following command: clawhub install openclaw/skills/skills/paki81/qwen-tts. Once installed, navigate to the skill directory at skills/public/qwen-tts and run the bash scripts/setup.sh command. This will initialize a dedicated virtual environment and download the necessary dependencies. Note that the first time you execute a speech synthesis task, the system will automatically download the 1.7GB model weight file from Hugging Face. Ensure you have sufficient disk space and a stable internet connection for this one-time initial setup.

Use Cases

This skill is perfect for creators needing local, offline voiceovers for multimedia projects. It is ideal for developers building voice-enabled applications, creating dynamic accessibility features for desktop tools, or generating interactive narrations within OpenClaw workflows. Because the model runs locally, it is suitable for sensitive data where privacy is paramount, as no audio data is transmitted to external servers.

Example Prompts

"OpenClaw, use qwen-tts to generate an Italian audio file saying 'Benvenuto nel futuro del text-to-speech' using the Vivian voice and save it as welcome.wav."
"Create a voice message using the Ryan speaker in English that says 'Hello, nice to meet you' with an enthusiastic and energetic tone."
"Please list all available speakers for the qwen-tts module so I can choose the best voice for my narrations."

Tips & Limitations

For optimal results, prioritize using a speaker's native language, though the model is cross-lingually capable. Use the -i flag to manipulate output style, such as 'Parla con entusiasmo' for Italian or 'Read like a narrator' for English. Since the model relies on a local 1.7B parameter footprint, ensure your system has sufficient RAM to process requests smoothly. As an offline model, it is limited to the predefined voice library and does not perform voice cloning of arbitrary audio samples.

qwen-tts

Why use this skill?

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)