ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified media Safety 4/5

voice-tts

语音输入(Whisper ASR)+ 语音输出(Edge TTS)技能,支持 agent 专属音色,可调用 send_voice_reply.mjs 发送 Telegram 语音消息。

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/believe3344/voice-tts
Or

What This Skill Does

The voice-tts skill is a comprehensive audio processing solution for OpenClaw agents, enabling seamless voice-to-text (ASR) and text-to-voice (TTS) capabilities. It bridges the gap between natural human speech and AI processing, allowing your agent to understand voice messages sent via platforms like Telegram, Discord, or Lark, and respond with high-quality natural-sounding voice replies. By leveraging Whisper for transcription and Edge TTS for speech synthesis, this skill ensures that your agent is accessible and interactive, providing both transcribed text and audio responses for a complete multimodal communication experience.

Installation

To install this skill, use the clawhub command: clawhub install openclaw/skills/skills/believe3344/voice-tts.

After installation, follow these mandatory steps:

  1. Navigate to the scripts/ directory within the skill path.
  2. Rename the files by removing the .txt extension to make them executable.
  3. Install necessary dependencies via pip: pip install edge-tts whisper torch click.
  4. Ensure ffmpeg is installed on your system (via brew install ffmpeg on macOS or sudo apt install ffmpeg on Ubuntu).
  5. Configure openclaw.json as detailed in the documentation to enable audio tool integration.

Use Cases

This skill is perfect for agents operating in messaging environments where users are on the move or prefer hands-free interaction. It is ideal for:

  • Virtual assistants handling customer inquiries received via voice notes.
  • Personal productivity agents that provide vocal summaries and reminders.
  • Multimodal bots in social communities that want to engage users through natural language interactions.
  • Automated transcription services that archive voice messages for later text review.

Example Prompts

  1. "(User sends a 30-second voice message)" -> The agent transcribes the message using Whisper and replies with a conversational text response plus an audio file generated by Edge TTS.
  2. "用语音读出最新的周报摘要" (Read the latest weekly report summary using voice)
  3. "请把这个消息用语音回复我" (Please reply to this message using voice)

Tips & Limitations

  • Model Selection: For faster performance on limited hardware, use the tiny or base Whisper models. For higher accuracy, large-v3 is recommended, though it requires significantly more RAM.
  • Voice Customization: You can customize the agent's tone by switching between available Edge TTS neural voices like zh-CN-XiaoxiaoNeural or en-US-JennyNeural.
  • FFmpeg Requirement: Do not skip the ffmpeg installation; it is the backbone of the audio processing pipeline and is required for both recording and playback formats.
  • Automatic Hooks: Take advantage of the auto_voice_check script for batch processing inbound voice messages to keep your agent's task queue efficient.

Metadata

Stars4473
Views1
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-believe3344-voice-tts": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#voice#tts#asr#audio#multimodal
Safety Score: 4/5

Flags: file-read, file-write, external-api