Official Verified media Safety 4/5

qwen3-audio

High-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT).

Why use this skill?

Harness powerful TTS, STT, and voice cloning on your Apple Silicon Mac with Qwen3-Audio. Build custom voices and transcribe audio locally.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/darknoah/qwen3-audio

Download Source Code (.zip)

What This Skill Does

Qwen3-Audio is a powerful, high-performance audio processing suite specifically engineered for Apple Silicon hardware (M1-M4). It bridges the gap between raw machine learning models and practical application, providing native Text-to-Speech (TTS) and Speech-to-Text (STT) capabilities. Beyond standard conversion, it features advanced voice cloning, emotion-aware synthesis, and generative voice design, allowing users to synthesize speech that matches specific linguistic and stylistic requirements. It acts as an all-in-one local audio processing engine for OpenClaw.

Installation

To install this skill, use the ClawHub CLI: clawhub install openclaw/skills/skills/darknoah/qwen3-audio. Before running, ensure your environment is configured by verifying the checklist located at ./references/env-check-list.md. Ensure Python 3.10+ is installed and your system is an Apple Silicon Mac.

Use Cases

Automated Transcription: Efficiently process long-form audio files or meeting recordings into text formats like SRT or TXT for accessibility or documentation.
Voice Branding: Clone a specific brand voice using reference samples to ensure consistent tone across all automated customer-facing audio responses.
Content Creation: Generate natural-sounding audio content for video projects, podcasts, or accessibility features by providing simple text scripts and stylistic prompts.

Example Prompts

"Convert this recording of our team meeting at ./recordings/meeting_01.wav into a synchronized SRT file to help me create subtitles for the video recap."
"Create a new synthetic voice for my virtual assistant that sounds like a professional, calm, and friendly customer support representative using the description: 'A soft-spoken, empathetic middle-aged professional voice.'"
"Synthesize the following text into an audio file: 'Welcome to our platform, please select an option from the menu.' Use the 'Ryan' speaker preset and make it sound energetic and welcoming."

Tips & Limitations

Optimization: Because this skill is built for Apple Silicon, performance will be significantly faster than standard CPU-based alternatives. Use the MLX backend to its fullest by keeping your environment clean.
Voice Storage: Always organize your voices in the voices/ folder. Ensure ref_audio.wav and ref_text.txt are aligned for best cloning results.
Limitations: Currently, this tool is restricted to Apple Silicon hardware. It does not support cloud-based synthesis, ensuring your audio data stays local and private.

Read Full Documentation on GitHub

Metadata

Author@darknoah

Stars3376

Updated2026-03-24

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-darknoah-qwen3-audio": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#audio#tts#stt#apple-silicon#voice-cloning

Safety Score: 4/5

Flags: file-write, file-read, code-execution

Related Skills

free-resource

Search and retrieve royalty-free media from Pixabay (images/videos), Freesound (audio effects), and Jamendo (music/BGM). Use when the user needs to find stock photos, illustrations, vectors, videos, sound effects, or background music, download media, or query media libraries with filters.

darknoah 3376

qwen-audio

High-performance audio library with text-to-speech (TTS) and speech-to-text (STT).

darknoah 3376

Rednote Cli

Skill by darknoah

darknoah 3376

redact

Privacy redaction toolkit for images, PDFs, Word documents, and PowerPoint presentations. Use when the user needs to redact, mask, or replace sensitive/private information in files. Triggers: - Redacting or masking sensitive text in images, PDFs, documents, or presentations - Replacing names, phone numbers, IDs, or other PII in files - Processing privacy compliance for documents before sharing - Anonymizing content in visual files Supported formats: png/jpg images, PDF, docx/doc, pptx/ppt

darknoah 3376