Official Verified media Safety 5/5

wavespeed-minimax-speech-26

Convert text to speech using MiniMax Speech 2.6 Turbo via WaveSpeed AI. Features ultra-human voice cloning, sub-250ms latency, 40+ languages, emotion control, and 200+ voice presets. Use when the user wants to generate speech audio from text.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/chengzeyi/wavespeed-minimax-speech-26

Download Source Code (.zip)

What This Skill Does

The wavespeed-minimax-speech-26 skill enables OpenClaw AI to perform high-fidelity text-to-speech (TTS) synthesis. Powered by the MiniMax Speech 2.6 Turbo engine through WaveSpeed AI, this tool converts textual data into human-like audio with sub-250ms latency. It supports a diverse range of 40+ languages and offers advanced features such as emotional inflection control (e.g., happy, sad, angry), precise pause insertion via the <#x#> tag syntax, and granular audio output customization, including sample rate, bitrate, and format control.

Installation

To integrate this skill into your OpenClaw environment, execute the following command in your terminal: clawhub install openclaw/skills/skills/chengzeyi/wavespeed-minimax-speech-26 After installation, ensure your environment variable WAVESPEED_API_KEY is configured with a valid key obtained from the WaveSpeed AI developer portal to enable authenticated communication with the API.

Use Cases

This skill is ideal for developers and content creators looking to build: 1) Interactive voice assistants or conversational agents requiring natural, human-like responses. 2) Automated audiobook or podcast narration systems where emotion and pacing are critical for engagement. 3) Accessibility tools that require real-time conversion of web content or documents into high-quality speech. 4) Media production pipelines that need rapid prototyping for voice-overs.

Example Prompts

"Generate a spoken audio file of a story about a dragon, set to a calm, expressive narrator voice, with a 2-second pause after the introduction."
"Convert this article into an MP3 file using a neutral professional voice at 1.1x speed, with a 24kHz sample rate for standard web playback."
"Say 'Welcome to our platform' in an enthusiastic tone using the English_Cheerful_Speaker voice profile."

Tips & Limitations

To maximize the quality of the output, use the pause control syntax <#x#> sparingly to maintain natural flow. The model supports up to 10,000 characters per request; for longer documents, implement a chunking strategy to process text in segments. Note that while emotion controls are robust, they are most effective with expressive voice IDs. Ensure that your output parameters match your infrastructure needs to balance bandwidth and fidelity.

Read Full Documentation on GitHub

Metadata

Author@chengzeyi

Stars3840

Updated2026-04-06

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-chengzeyi-wavespeed-minimax-speech-26": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#tts#audio#speech-synthesis#wavespeed#minimax

Safety Score: 5/5

Flags: external-api

Related Skills

wavespeed-watermark-remover

Remove watermarks, logos, captions, and text overlays from images and videos using WaveSpeed AI. Intelligently detects and removes watermarks while preserving texture and background. Supports images and videos up to 10 minutes. Use when the user wants to remove watermarks or text overlays from media.

chengzeyi 3840

wavespeed-face-swapper

Swap faces in images and videos using WaveSpeed AI. Supports image face swap and video face swap with multi-face targeting. Produces watermark-free results with automatic lighting and skin tone adaptation. Use when the user wants to replace a face in an image or video with another face.

chengzeyi 3840

wavespeed-infinitetalk

Generate talking head videos from a portrait image and audio using WaveSpeed AI's InfiniteTalk model. Produces lip-synced video up to 10 minutes long at 480p or 720p. Supports optional mask images to target specific faces and text prompts for additional guidance. Use when the user wants to animate a face with audio or create talking avatar videos.

chengzeyi 3840

wavespeed-seedream-45

Generate and edit images using ByteDance's Seedream V4.5 model via WaveSpeed AI. Supports text-to-image generation and multi-image editing with custom resolutions up to 4096x4096. Features enhanced typography for posters and logos. Use when the user wants to create or edit images with high-quality text rendering.

chengzeyi 3840

wavespeed-nano-banana-2

Generate and edit images using Google's Nano Banana 2 model via WaveSpeed AI. Supports text-to-image generation and image editing with natural language prompts. Features native 4K resolution, flexible aspect ratios including ultra-narrow (1:8, 8:1), multilingual text rendering, and camera-style controls. Use when the user wants to create images from text or edit existing images.

chengzeyi 3840