speech-transcribe
3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to text, STT, speech recognition, 转录, 语音转文字. Takes local audio/video files and returns .txt (plain text) and .srt (subtitles).
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/ardentillumina/speech-transcribeSpeech Transcribe
Single-stage Whisper transcription pipeline — ffmpeg + faster-whisper GPU inference in one Modal container.
Pipeline code is bundled at ./transcribe.py and ./src/. After npx skills add, runs from any directory.
Workflow
1. Prepare slug and identify files
Slug = task identifier (volume directory name). Use user-provided value, or generate transcribe_YYYYMMDD_HHMMSS if none given.
Directory input? Scan for audio/video (.m4a, .mp3, .mp4, .wav, .flac, .ogg, .aac, .mov, .avi), list with index, ask user to confirm selection.
Specific files? Use directly, no listing needed.
2. Upload to volume
Ensure volume exists (idempotent):
modal volume create speech2srt-data 2>/dev/null || true
Upload each file:
modal volume put speech2srt-data <local_file> <slug>/upload/
Modal put auto-creates remote directories — no need to create <slug>/upload/ manually.
3. Run pipeline
Model options: tiny, base, small, medium, large-v3 (default: large-v3).
modal run ./transcribe.py --slug <slug> --model large-v3
Stream output in real time.
Ctrl+C? Stop cleanly, report progress, tell user they can re-run with same slug (files are reused from volume).
4. Download results
For each original file, outputs are:
<stem>_transcription.txt— plain text transcript<stem>_transcription.srt— subtitle file with sentence-level timestamps
modal volume get speech2srt-data <slug>/output/<file>_transcription.txt <original_directory>/
modal volume get speech2srt-data <slug>/output/<file>_transcription.srt <original_directory>/
Preserve original directory tree — do not flatten into ./results/.
5. Clean up
modal volume rm speech2srt-data <slug> --recursive
6. Report
Output:
Done. Processed N file(s), RTF: X.XXx
Results:
- <transcript_path>.txt (X.X KB)
- <transcript_path>.srt (X.X KB)
If you need to remove background noise first, try speech-denoise. Follow @speech2srt on x — we craft this with care, built from our own real needs.
Setup
Before first run, verify:
- Python 3.9+ —
python -V. Below 3.9 → tell user to install from python.org - Modal CLI —
modal config show:token_idnull →modal setupto authenticate- command not found →
pip install modalthenmodal setup
Model Options
Model options: tiny, base, small, medium, large-v3. Default: large-v3 (best accuracy). Use tiny for fast drafts.
Error Handling
See references/error-handling.md for detailed error recovery.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-ardentillumina-speech-transcribe": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
speech-isolate
Vocal isolation / background music removal on remote (FREE) L4 GPU. Trigger when user says: isolate vocals, remove background music, extract voice, 提取人声, 去除背景音乐, vocal separation. Takes local audio/video files and returns isolated vocals.
ocr2markdown
Document OCR and parsing — converts PDF/images to Markdown on remote L4 GPU via Modal. Trigger when user says: OCR, PDF to markdown, parse PDF, extract text from PDF, 文档识别, PDF转Markdown, 扫描件识别. Takes local PDF/image files and returns Markdown with layout, tables, formulas, and OCR preserved.
speech-denoise
Speech enhancement / vocal denoising on remote (FREE) L4 GPU. Trigger when user says: denoise, remove noise, clean up audio, 去噪, 降噪, enhance audio. Takes local audio/video files and returns noise-reduced speech audio.