Official Verified ai models Safety 4/5

qwen3-tts-local-inference

Generate speech from text using Qwen3-TTS via direct Python inference — no server required. Use when: (1) converting text to speech / synthesising audio, (2) creating voiceovers or spoken content, (3) cloning a voice from reference audio, (4) generating TTS with built-in speakers or custom voice descriptions. Supports custom-voice (9 speakers), voice-design (natural language), and voice-clone (~3 s reference). Outputs .wav files. Both 0.6B (small, default) and 1.7B (large) models available. Runs entirely offline after model download.

Why use this skill?

Generate speech from text locally with Qwen3-TTS. Supports voice cloning, custom voices & voice design offline. No server needed.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/jithinm/qwen3-tts-local-inference

Download Source Code (.zip)

What This Skill Does

The qwen3-tts-local-inference skill allows you to generate speech from text directly using the Qwen3 Text-to-Speech (TTS) model on your local machine. Unlike skills that rely on external servers or APIs, this skill performs all operations within your Python environment after an initial model download, making it a fully offline solution. It supports various modes of speech generation: 'custom-voice' which offers 9 distinct built-in speakers with optional emotion and style adjustments, 'voice-design' where you can describe the desired voice characteristics in natural language, and 'voice-clone' which enables you to replicate a voice from a short reference audio clip (approximately 3 seconds).

The skill outputs audio in the standard .wav format. You can choose between two model sizes: the default 0.6B (small) and the larger 1.7B model, offering a trade-off between performance and resource usage. This skill is ideal for scenarios requiring text-to-speech conversion without internet connectivity or the overhead of managing a separate server.

Installation

To install the qwen3-tts-local-inference skill, you first need to have the OpenClaw environment set up. Once ready, execute the following command in your terminal:

clawhub install openclaw/skills/skills/jithinm/qwen3-tts-local-inference

After installation, the skill's dependencies need to be set up. Navigate to the skill's directory (typically within your OpenClaw installation) and run:

bash scripts/setup.sh

By default, the Qwen3-TTS models will be downloaded to a models/ directory within the skill. You can customize this download location using the QWEN_TTS_MODEL_DIR environment variable or by specifying the --model-dir flag during model download or execution. To download models to a specific path, use:

python scripts/download_models.py --model-dir /path/to/your/custom/model/directory

Use Cases

This skill is highly versatile and can be used in a variety of applications:

Content Creation: Generate voiceovers for videos, podcasts, audiobooks, or presentations directly on your local machine.
Accessibility: Convert written content into spoken audio for users who prefer auditory information.
Prototyping & Development: Quickly integrate speech synthesis into applications without relying on external services, useful during development or for offline-first applications.
Personalized Audio: Create custom audio messages or notifications with specific voice characteristics or cloned voices.
Language Learning Tools: Generate pronunciation examples or spoken dialogues in various languages.
Game Development: Create in-game character voices or narrative audio locally.

Example Prompts

Here are three examples of prompts you might use with the qwen3-tts-local-inference skill:

"Convert the following text to speech using the 'Ryan' voice, in English: 'Welcome to our demonstration of the Qwen3 Text-to-Speech local inference skill.'"
"Create an audio file with a cheerful and energetic female voice describing a new product launch."
"Clone the voice from this audio clip and read the text 'This is a test of the voice cloning feature.' Make sure to use the provided reference audio and its transcript."

Tips & Limitations

Model Size: The 0.6B model is faster and uses less memory, while the 1.7B model may offer higher quality but requires more resources.
Offline Use: Once models are downloaded, the skill operates entirely offline, which is a significant advantage for privacy and accessibility.
Voice Cloning Quality: The quality of voice cloning depends heavily on the clarity and duration (around 3 seconds) of the reference audio. Background noise can degrade the clone's quality.
Customization: Experiment with the --instruct parameter in 'custom-voice' mode to fine-tune the emotion and style of the built-in speakers. For 'voice-design', be descriptive but concise when defining the target voice.
Output Format: All audio is generated as .wav files.
Resource Intensive: While offline, running TTS models, especially the larger ones, can be CPU and memory intensive. Ensure your system has adequate resources for smooth operation.
Language Support: While the skill supports many languages, the quality and availability of specific voices or stylistic nuances may vary.

Read Full Documentation on GitHub

Metadata

Author@jithinm

Stars1947

Updated2026-03-04

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-jithinm-qwen3-tts-local-inference": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#tts#speech synthesis#offline#qwen3

Safety Score: 4/5

Flags: file-write, file-read, code-execution

Related Skills

mongo-db

Interact with a MongoDB database for persistent document storage. Supports full CRUD operations (find, insert, update, delete), aggregation pipelines, collection management, and index creation. Use when any agent needs to store or retrieve data in MongoDB — for example, persisting financial records, budgets, watchlists, or any structured data across sessions.

jithinm 1947