Audio Speaker Tools

Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.

Overview

This skill provides three main workflows:

Speaker separation - Extract per-speaker audio from multi-speaker recordings
Voice comparison - Measure speaker similarity between two audio files
Audio processing - Segment extraction and voice isolation

Prerequisites

Setup Virtual Environment

Run once to create the venv and install dependencies:

bash scripts/setup_venv.sh

Default venv location: ./.venv

Requirements:

Python 3.9+
ffmpeg (brew install ffmpeg)
HuggingFace token (set as env var HF_TOKEN)

Scripts

1. Speaker Separation: `diarize_and_slice_mps.py`

Separate speakers from multi-speaker audio:

# Basic usage
HF_TOKEN=<your-hf-token> \
  /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \
  --input audio.mp3 \
  --outdir /path/to/output \
  --prefix MyShow

# With speaker constraints
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \
  --input audio.mp3 \
  --outdir ./out \
  --min-speakers 2 \
  --max-speakers 5 \
  --pad-ms 100

Process:

Converts input to 16kHz mono WAV
Runs Demucs vocal/background separation (optional, for cleaner input)
Runs pyannote speaker diarization (MPS-accelerated)
Extracts concatenated per-speaker WAV files

Output:

<prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker)
diarization.rttm (time-stamped speaker segments)
segments.jsonl (JSON segments metadata)
meta.json (pipeline info and speaker index)

Important:

Always pass HF token via HF_TOKEN env var, never as CLI arg
MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavailable
Default output: ./separated/

2. Voice Comparison: `compare_voices.py`

Measure similarity between two voice samples using Resemblyzer:

# Basic comparison
python scripts/compare_voices.py \
  --audio1 sample1.wav \
  --audio2 sample2.wav

# JSON output
python scripts/compare_voices.py \
  --audio1 reference.wav \
  --audio2 clone.wav \
  --threshold 0.85 \
  --json

# Exit code = 0 if pass, 1 if fail

Scores:

< 0.75 = Different speakers
0.75-0.84 = Likely same speaker
0.85+ = Excellent match (ideal for voice cloning validation)

Use cases:

Voice clone quality assessment (compare clone vs. original)
Speaker verification (authenticate speaker identity)
Validate speaker separation (confirm separated speakers are distinct)

See: references/scoring-guide.md for detailed interpretation

3. Audio Trimming

Use ffmpeg directly for segment extraction:

# Extract 10-second segment starting at 5 seconds
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3

# Extract vocals only with Demucs (before diarization)
demucs --two-stems vocals --out ./separated input.mp3

audio-speaker-tools

Install via CLI (Recommended)

Audio Speaker Tools

Overview

Prerequisites

Setup Virtual Environment

Scripts

1. Speaker Separation: `diarize_and_slice_mps.py`

2. Voice Comparison: `compare_voices.py`

3. Audio Trimming

Workflows

Metadata

audio-speaker-tools

Install via CLI (Recommended)

Audio Speaker Tools

Overview

Prerequisites

Setup Virtual Environment

Scripts

1. Speaker Separation: diarize_and_slice_mps.py

2. Voice Comparison: compare_voices.py

3. Audio Trimming

Workflows

Metadata

1. Speaker Separation: `diarize_and_slice_mps.py`

2. Voice Comparison: `compare_voices.py`