What This Skill Does

The voice-stt-tts skill provides a robust, local-first integration for OpenClaw to handle spoken communication. By bridging the powerful faster-whisper transcription engine with the high-quality Edge TTS synthesizer, this skill enables a truly hands-free experience. When triggered, the skill captures audio input, transcribes it into high-fidelity text using a configurable model (defaulting to 'small'), and allows the agent to reply with a human-like synthesized voice. This creates a bidirectional conversational loop, transforming your OpenClaw agent into an interactive personal assistant capable of listening, processing, and vocalizing responses in real-time.

Installation

Installation follows a standard modular approach within the OpenClaw ecosystem. You will need to create a dedicated Python virtual environment at ~/.openclaw/workspace/voice-messages to maintain dependency isolation. Once the environment is ready, install faster-whisper and its core dependencies using the provided pip command. After setting up the environment, ensure you save the transcribe.py script provided in the documentation to the designated workspace directory and apply execution permissions with chmod +x. Finally, integrate the skill by updating your ~/.openclaw/openclaw.json configuration file, pointing the CLI tool to your local Python interpreter and script path. This setup ensures your agent can reliably process media files triggered during your session.

Use Cases

This skill is perfect for users who prefer voice-based interactions over traditional typing. It is ideal for hands-free operations, such as dictating tasks while busy, recording voice memos for later summarization, or creating a voice-interactive interface for home automation projects. It is also an excellent tool for accessibility, allowing users with limited mobility to interact with complex agentic workflows using only voice commands.

Example Prompts

"OpenClaw, listen to my instructions for the daily standup report: add a note about the server migration delay."
"Hey, what is on my calendar for tomorrow? Please read it back to me."
"Transcribe this meeting audio file and give me a bulleted summary of the main decisions."

Tips & Limitations

To optimize performance, match your device capabilities to the model size; while 'small' is efficient for most CPUs, 'large-v3' can be used on machines with dedicated CUDA hardware for superior accuracy. Note that VAD (Voice Activity Detection) is enabled by default, which is highly effective at ignoring background silence, but ensure your microphone input is clear to maintain transcription quality. The skill requires sufficient disk space for the Whisper models and dependencies (approximately 250MB). Always monitor your timeoutSeconds setting in the configuration if you plan on processing long audio files, as transcription time scales linearly with audio duration.

voice-stt-tts

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)