What This Skill Does

The mlx-local-inference skill empowers your OpenClaw agent to leverage Apple Silicon's hardware acceleration for private, high-performance AI inference directly on your machine. By bypassing cloud APIs, this skill ensures your sensitive data stays local while providing low-latency execution for tasks including text generation, vision-language analysis, speech transcription, and OCR. It integrates the oMLX gateway for continuous LLM batching and utilizes 'uv' for transient Python execution of specialized libraries like mlx-lm, mlx-vlm, and mlx-audio.

Installation

To integrate this capability into your agent, use the ClawKit CLI to install the dependency from the centralized repository. Run the following command in your terminal:

clawhub install openclaw/skills/skills/bendusy/mlx-local-inference

Once installed, ensure your local models are correctly placed in your ~/models directory, as the skill expects specific weight files for Qwen and PaddleOCR configurations. Verify your oMLX service status via curl http://localhost:8000/v1/models.

Use Cases

This skill is ideal for workflows requiring data sovereignty and offline availability. Use it for:

Private Document Analysis: Extract text from scanned PDFs or images using local OCR without uploading to a third-party server.
Real-time Audio Transcription: Convert local meeting recordings or voice memos to text using quantized ASR models.
Latency-Critical Agent Flows: Execute complex LLM reasoning chains locally to avoid network bottlenecks.
High-Volume Embedding Tasks: Generate vector representations for local search and retrieval augmented generation (RAG) tasks.

Example Prompts

"Analyze the attached invoice image using local OCR and extract the total amount and merchant name."
"Transcribe the file 'meeting_notes.wav' located in my downloads folder using the local ASR engine."
"Summarize this private legal document using the Qwen3.5-35B local model, ensuring the data never leaves my Mac."

Tips & Limitations

Performance: Always ensure your Mac is plugged into power during heavy inference, as Apple Silicon may throttle performance to preserve battery life.
Resource Management: The oMLX stack is optimized for continuous batching, but loading large models like Qwen3.5-35B consumes significant unified memory. Close memory-intensive apps like browsers or creative suites when running large models.
Versioning: For ASR and OCR tasks requiring uv, always use --python 3.11 to prevent potential SIGSEGV errors associated with OpenMP and Python's threading model on macOS.
Privacy: This is a purely offline-first tool. If you require zero data egress, ensure your firewall is configured to block unexpected outbound requests from the agent process.

mlx-local-inference

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)