Official Verified

flash-moe-inference

Run 397B parameter Mixture-of-Experts LLMs on a MacBook using pure C/Metal with SSD streaming

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/adisinghstudent/flash-moe-inference

Download Source Code (.zip)

Flash-MoE Inference Engine

Skill by ara.so — Daily 2026 Skills collection.

Flash-MoE is a pure C/Objective-C/Metal inference engine that runs Qwen3.5-397B-A17B (397B parameter Mixture-of-Experts) on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. It streams 209GB of expert weights from NVMe SSD on demand — no Python, no ML frameworks, just C, Objective-C, and hand-tuned Metal shaders.

Requirements

Hardware: Apple Silicon Mac (M3 Max or similar), 48GB+ unified memory, 1TB+ SSD with ~210GB free
OS: macOS 26+ (Darwin 25+)
Tools: Xcode Command Line Tools, Python 3.x (for weight extraction only)
Model: Qwen3.5-397B-A17B safetensors weights (download separately from HuggingFace)

Installation & Build

# Clone the repo
git clone https://github.com/danveloper/flash-moe
cd flash-moe/metal_infer

# Build everything
make

# Verify build artifacts
ls infer chat main

The Makefile compiles infer.m, chat.m, main.m with Metal shader compilation for shaders.metal.

Weight Preparation

Step 1: Extract non-expert weights

# From the metal_infer/ directory
# Point to your downloaded Qwen3.5-397B safetensors directory
python3 extract_weights.py /path/to/Qwen3.5-397B-A17B-Instruct/

# Produces:
#   model_weights.bin   (~5.5GB, mmap'd at runtime)
#   model_weights.json  (tensor manifest)
#   vocab.bin           (vocabulary)
#   tokenizer.bin       (BPE tokenizer data)

Step 2: Pack expert weights (4-bit, production)

# From repo root
python3 repack_experts.py /path/to/Qwen3.5-397B-A17B-Instruct/ metal_infer/packed_experts/

# Produces packed_experts/ directory (~209GB)
# Each expert is a separate file: layer_XX_expert_YYYY.bin

Step 3: Optional 2-bit requantization (faster but breaks JSON/tool calling)

# Convert 4-bit experts to 2-bit (saves ~89GB, 120GB total)
python3 metal_infer/repack_experts_2bit.py \
    metal_infer/packed_experts/ \
    metal_infer/packed_experts_2bit/

Key Commands

Basic inference

cd metal_infer

# 4-bit inference (production quality, tool calling works)
./infer --prompt "Explain quantum computing" --tokens 100

# 2-bit inference (faster, breaks JSON/tool calling)
./infer --prompt "Explain quantum computing" --tokens 100 --2bit

# Per-layer timing breakdown
./infer --prompt "Hello" --tokens 20 --timing

Interactive chat with tool calling

./chat
# Opens TUI with full tool calling support
# Uses 4-bit experts by default

MoE-only benchmark (measures expert throughput)

./main
# Runs pure expert forward-pass benchmark
# Reports tokens/sec without attention overhead

Project Structure

Read Full Documentation on GitHub

Metadata

Author@adisinghstudent

Stars3809

Updated2026-04-05

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-adisinghstudent-flash-moe-inference": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

Oh My Openagent Omo

Skill by adisinghstudent

adisinghstudent 3809

Planning With Files Manus Workflow

Skill by adisinghstudent

adisinghstudent 3809

mirofish-offline-simulation

Fully local multi-agent swarm intelligence simulation engine using Neo4j + Ollama for public opinion, market sentiment, and social dynamics prediction.

adisinghstudent 3809

ghostling-libghostty-terminal

Build minimal terminal emulators using the libghostty-vt C API with Raylib for windowing and rendering

adisinghstudent 3809

Obra Superpowers Agentic Workflow

Skill by adisinghstudent

adisinghstudent 3809