Official Verified

minimax-tools

Direct MiniMax API integration for speech synthesis (TTS), voice cloning, image generation, video generation, and music generation using local Python scripts instead of MCP. Use when you want reliable script-based MiniMax workflows inside OpenClaw for: (1) text-to-speech with built-in Chinese/English defaults or explicit voice IDs, (2) voice cloning with upload + preview flows, (3) text-to-image or reference-image generation, (4) text-to-video, image-to-video, or first/last-frame video generation with async polling/download, and (5) music generation from prompts and lyrics.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/cytwyatt/minimax-tools-skill

Download Source Code (.zip)

MiniMax Tools

Use this skill to call MiniMax multimodal APIs directly through local Python wrappers instead of relying on an external MCP server.

Overview

This skill currently supports:

Speech synthesis (TTS)
Voice cloning
Image generation
Video generation
Music generation

All wrappers are exposed through a single entrypoint script:

python3 scripts/minimax.py <subcommand> ...

Read references/api-notes.md only when you need endpoint details or parameter reminders.

Prerequisites

Expect these environment variables to be available before running the scripts:

MINIMAX_API_KEY

Optional:

MINIMAX_BASE_URL if you need to override the default API host

Python dependency:

requests

Routing guide

Use tts for speech synthesis
Use voice for uploading clone inputs, creating cloned voices, and optionally downloading preview audio
Use image for text-to-image or reference-image generation
Use video for text-to-video, image-to-video, or first/last-frame video workflows
Use music for song or instrumental generation

TTS defaults

Default model: speech-2.8-turbo
Default format: mp3
Default sample rate: 32000
Default bitrate: 128000
Default Chinese voice: Chinese (Mandarin)_Lyrical_Voice
Default English voice: English_Graceful_Lady
If --voice is omitted, the script uses --voice-lang zh|en and defaults to zh

Voice cloning notes

Clone source audio constraints:
- mp3, m4a, or wav
- 10 seconds to 5 minutes
- <= 20 MB
Optional prompt audio constraints:
- mp3, m4a, or wav
- under 8 seconds
- <= 20 MB
If cloning succeeds, the returned voice_id can be used immediately in TTS
MiniMax documentation notes cloned voices are temporary unless used in real TTS within 7 days

Video support

Supported modes:

text-to-video: video create
image-to-video: video i2v
first/last-frame video: video fl2v

Video creation is asynchronous. Use video query, video wait, and video download for task follow-up.

File handling rules

Prefer saving outputs locally and returning file paths
Local image inputs for image/video wrappers can be converted to Data URLs automatically
Prefer URL-based output when MiniMax returns temporary files, then download immediately
Avoid tight polling loops for async video jobs

Resources

scripts/minimax.py - unified CLI entrypoint
scripts/minimax_tts.py - TTS wrapper
scripts/minimax_voice.py - voice cloning wrapper
scripts/minimax_image.py - image generation wrapper
scripts/minimax_video.py - video generation wrapper
scripts/minimax_music.py - music generation wrapper
references/api-notes.md - focused API notes and constraints

Read Full Documentation on GitHub

Metadata

Author@cytwyatt

Stars3409

Updated2026-03-25

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-cytwyatt-minimax-tools-skill": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

cloudflare-browser-rendering

Use Cloudflare Browser Rendering REST APIs to extract rendered webpage content as Markdown or crawl whole sites asynchronously. Use when normal web_fetch is insufficient because pages are JavaScript-heavy, require render-time extraction, or you need multi-page site crawling for docs, research, monitoring, or RAG preparation. Prefer this skill for: (1) converting a rendered page to Markdown with /markdown, (2) crawling a documentation site or knowledge base with /crawl, (3) controlling render/load behavior via gotoOptions, cookies, auth, userAgent, or request filtering. Do not use it for interactive login/button-click workflows; use browser for those.

cytwyatt 3409