flash-moe-inference
Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/adisinghstudent/flash-moe-inferenceFlash-MoE Inference Engine
Skill by ara.so — Daily 2026 Skills collection.
Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.
Requirements
- Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
- OS: macOS 26+ (Darwin 25+)
- Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
- Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)
Installation & Build
# Clone the repo
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer
# Build everything
make
# Verify build artifacts
ls infer chat main
The Makefile compiles infer.m, chat.m, main.m with Metal shader compilation for shaders.metal.
Weight Preparation
Step 1: Extract non-expert weights
# From the metal_infer/ directory
# Point to your downloaded Qwen3.5-397B safetensors directory
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/
# Produces:
# model_weights.bin (~5.5GB, mmap'd at runtime)
# model_weights.json (tensor manifest)
# vocab.bin (vocabulary)
# tokenizer.bin (BPE tokenizer data)
Step 2: Pack expert weights (4-bit, production)
# From repo root
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/
# Produces packed_experts/ directory (~209GB)
# Each expert is a separate file: layer_XX_expert_YYYY.bin
Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)
# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
python3 metal_infer/repack_experts_2bit.py \
metal_infer/packed_experts/ \
metal_infer/packed_experts_2bit/
Key Commands
Basic inference
cd metal_infer
# 4-bit inference (production quality, tool calling works)
./infer --prompt "Explain quantum computing" --tokens 100
# 2-bit inference (faster, breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit
# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing
Interactive chat with tool calling
./chat
# Opens TUI with full tool calling support
# Uses 4-bit experts by default
MoE-only benchmark (measures expert throughput)
./main
# Runs pure expert forward-pass benchmark
# Reports tokens/sec without attention overhead
Project Structure
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-adisinghstudent-flash-moe-inference": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Oh My Openagent Omo
Skill by adisinghstudent
Planning With Files Manus Workflow
Skill by adisinghstudent
mirofish-offline-simulation
Fully local multi-agent swarm intelligence simulation engine using Neo4j + Ollama for public opinion, market sentiment, and social dynamics prediction.
ghostling-libghostty-terminal
Build minimal terminal emulators using the libghostty-vt C API with Raylib for windowing and rendering
Obra Superpowers Agentic Workflow
Skill by adisinghstudent