ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

audio-to-text-and-video-to-text

Transcribe audio and video files into text using OpenAI's Whisper API. Use this skill whenever a user wants to convert any audio or video file to text — including MP3, MP4, WAV, M4A, OGG, WEBM, MOV, AVI, FLAC, and more. Trigger this skill for any request involving: "transcribe", "convert audio to text", "speech to text", "get transcript of", "extract audio from video", "meeting notes from recording", "subtitles", "captions", or similar. Also trigger when the user uploads or references a media file and asks what was said, discussed, or mentioned in it. If unsure whether audio/video transcription is involved, use this skill.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ahqazi-dev/audio-to-text-and-video-to-text
Or

Transcription Skill

Converts audio and video files into clean, readable text using OpenAI's Whisper API and ffmpeg for media handling.

Overview

This skill handles the full pipeline:

  1. Media extraction — use ffmpeg to strip audio from video files and convert to a Whisper-compatible format
  2. Chunking — split large files (>25 MB) into overlapping segments to stay within API limits
  3. Transcription — send each chunk to OpenAI's Whisper API
  4. Assembly — merge chunk transcripts, adjusting timestamps, into a single clean output
  5. Post-processing — optionally clean up with Claude (punctuation, speaker labels, summaries)

Requirements

  • ffmpeg must be installed (which ffmpeg to verify — it's usually pre-installed in claude.ai's environment)
  • OpenAI API key stored in the environment as OPENAI_API_KEY — the user must provide this
  • Python packages: openai, pydub (install via pip if needed)

Quick Start

When a user provides a media file, run the transcription script:

# Install dependencies if missing
pip install openai pydub --break-system-packages -q

# Run transcription
python /home/claude/transcription/scripts/transcribe.py \
  --input "/path/to/media/file" \
  --output "/mnt/user-data/outputs/transcript.txt" \
  --api-key "$OPENAI_API_KEY"

See scripts/transcribe.py for the full implementation.

Supported Formats

CategoryFormats
Audiomp3, wav, m4a, ogg, flac, aac, opus, wma
Videomp4, mov, avi, mkv, webm, wmv, m4v

ffmpeg handles extraction from any of these.

Options & Flags

FlagDefaultDescription
--modelwhisper-1Whisper model to use (whisper-1, gpt-4o-transcribe)
--languageauto-detectISO 639-1 language code (e.g. en, ar, fr)
--formattxtOutput format: txt, srt, vtt, json
--timestampsoffInclude timestamps in output
--chunk-size20Max chunk size in MB (must be ≤ 25)
--promptnoneContext hint to improve accuracy (e.g. domain vocab)

Output Formats

  • txt — plain text, ideal for most uses
  • srt — SubRip subtitle format (for video players)
  • vtt — WebVTT format (for web video)
  • json — full Whisper JSON with segments and timestamps

Step-by-Step Workflow

1. Check for the file

Ask the user to upload the file or provide a local path. Check:

ls /mnt/user-data/uploads/

2. Check ffmpeg and install deps

which ffmpeg && ffmpeg -version 2>&1 | head -1
pip install openai pydub --break-system-packages -q 2>&1 | tail -3

3. Get the API key

If OPENAI_API_KEY is not set in the environment, ask the user:

"Please provide your OpenAI API key — it starts with sk-. You can get one at https://platform.openai.com/api-keys"

4. Run the script

Metadata

Stars4473
Views0
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ahqazi-dev-audio-to-text-and-video-to-text": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.