audio-to-text-and-video-to-text
Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/ahqazi-dev/audio-to-text-and-video-to-textTranscription Skill
Converts audio and video files into clean, readable text using OpenAI's Whisper API and ffmpeg for media handling.
Overview
This skill handles the full pipeline:
- Media extraction — use ffmpeg to strip audio from video files and convert to a Whisper-compatible format
- Chunking — split large files (>25 MB) into overlapping segments to stay within API limits
- Transcription — send each chunk to OpenAI's Whisper API
- Assembly — merge chunk transcripts, adjusting timestamps, into a single clean output
- Post-processing — optionally clean up with Claude (punctuation, speaker labels, summaries)
Requirements
- ffmpeg must be installed (
which ffmpegto verify — it's usually pre-installed in claude.ai's environment) - OpenAI API key stored in the environment as
OPENAI_API_KEY— the user must provide this - Python packages:
openai,pydub(install via pip if needed)
Quick Start
When a user provides a media file, run the transcription script:
# Install dependencies if missing
pip install openai pydub --break-system-packages -q
# Run transcription
python /home/claude/transcription/scripts/transcribe.py \
--input "/path/to/media/file" \
--output "/mnt/user-data/outputs/transcript.txt" \
--api-key "$OPENAI_API_KEY"
See scripts/transcribe.py for the full implementation.
Supported Formats
| Category | Formats |
|---|---|
| Audio | mp3, wav, m4a, ogg, flac, aac, opus, wma |
| Video | mp4, mov, avi, mkv, webm, wmv, m4v |
ffmpeg handles extraction from any of these.
Options & Flags
| Flag | Default | Description |
|---|---|---|
--model | whisper-1 | Whisper model to use (whisper-1, gpt-4o-transcribe) |
--language | auto-detect | ISO 639-1 language code (e.g. en, ar, fr) |
--format | txt | Output format: txt, srt, vtt, json |
--timestamps | off | Include timestamps in output |
--chunk-size | 20 | Max chunk size in MB (must be ≤ 25) |
--prompt | none | Context hint to improve accuracy (e.g. domain vocab) |
Output Formats
- txt — plain text, ideal for most uses
- srt — SubRip subtitle format (for video players)
- vtt — WebVTT format (for web video)
- json — full Whisper JSON with segments and timestamps
Step-by-Step Workflow
1. Check for the file
Ask the user to upload the file or provide a local path. Check:
ls /mnt/user-data/uploads/
2. Check ffmpeg and install deps
which ffmpeg && ffmpeg -version 2>&1 | head -1
pip install openai pydub --break-system-packages -q 2>&1 | tail -3
3. Get the API key
If OPENAI_API_KEY is not set in the environment, ask the user:
"Please provide your OpenAI API key — it starts with
sk-. You can get one at https://platform.openai.com/api-keys"
4. Run the script
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-ahqazi-dev-audio-to-text-and-video-to-text": {
"enabled": true,
"auto_update": true
}
}
}