mlx-audio-server
Local 24x7 OpenAI-compatible API server for STT/TTS, powered by MLX on your Mac.
Why use this skill?
Power your OpenClaw agents with local, high-speed Speech-To-Text and Text-To-Speech using MLX on Apple Silicon. Fast, private, and 24/7.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/guoqiao/mlx-audio-serverWhat This Skill Does
The mlx-audio-server skill provides a robust, local-first bridge between OpenClaw and your Apple Silicon Mac's native hardware capabilities. It deploys an OpenAI-compatible API server using the MLX framework, enabling high-performance, low-latency Speech-To-Text (STT) and Text-To-Speech (TTS) operations. By leveraging the Apple Neural Engine, this skill ensures that audio processing is fast, efficient, and private, keeping all data on your local machine. It serves as a vital component for voice-enabled agents, allowing them to "hear" audio files and "speak" responses without requiring expensive or privacy-invasive cloud subscription APIs.
Installation
To integrate this capability, run the command clawhub install openclaw/skills/skills/guoqiao/mlx-audio-server in your terminal. This command pulls the necessary scripts and dependencies, including the mlx-audio-server Homebrew formula from guoqiao/tap. The installation process automatically verifies that critical utilities like ffmpeg and jq are present on your system. It also registers the server as a macOS LaunchAgent, ensuring the audio server remains active in the background, ready to process requests 24/7 without manual intervention.
Use Cases
- Voice-to-Text Transcription: Automatically transcribe meetings, interviews, or voice memos directly on your Mac.
- Voice-Enabled Agent Interaction: Enable OpenClaw to speak its responses aloud, creating a more human-like interface for your automation workflows.
- Offline Media Processing: Analyze video or audio assets locally for content extraction or indexing without uploading files to third-party services.
- Privacy-First Dictation: Use the system as a local backend for building dictation tools that never transmit sensitive audio data over the internet.
Example Prompts
- "Transcribe this audio file located at /Users/me/downloads/meeting.mp3 and summarize the key action items."
- "Convert this text: 'The automation workflow completed successfully' into an audio file and save it to my current directory."
- "Convert the recording at ./input_voice.wav into text so I can search for the specific mention of the budget update."
Tips & Limitations
- Hardware Constraint: This skill is exclusively optimized for Apple Silicon (M1, M2, M3, M4 chips). It will not function on Intel-based Macs.
- Resource Usage: MLX models can be memory-intensive. While they are highly optimized for macOS, ensure you have sufficient RAM available when processing long audio files.
- Default Models: The server comes pre-configured with
glm-asr-nanofor transcription andQwen3-TTSfor speech synthesis. While these are highly efficient, you can modify the underlying scripts to swap models if your workflow requires higher accuracy or different voice characteristics.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-guoqiao-mlx-audio-server": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-write, file-read, code-execution
Related Skills
mlx-stt
Speech-To-Text with MLX (Apple Silicon) and opensource models (default GLM-ASR-Nano-2512) locally.
dl
Download Video/Music from YouTube/Bilibili/X/etc.
url2pdf
Convert URL to PDF suitable for mobile reading.
uv-global
Provision and reuse a global uv environment for ad hoc Python scripts.
url2png
Convert URL to PNG suitable for mobile reading.