Video Captions
Generate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/ivangdavila/video-captionsWhen to Use
User needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.
Quick Reference
| Topic | File |
|---|---|
| Transcription engines | engines.md |
| Output formats | formats.md |
| Styling presets | styling.md |
| Platform requirements | platforms.md |
Core Rules
1. Engine Selection by Context
| Scenario | Engine | Why |
|---|---|---|
| Default (recommended) | Whisper local | 100% offline, no data leaves machine |
| Apple Silicon | MLX Whisper | Native acceleration, still local |
| Word timestamps | whisper-timestamped | DTW alignment, still local |
Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.
2. Format Selection by Platform
| Platform | Format | Notes |
|---|---|---|
| YouTube | VTT or SRT | VTT preferred |
| Netflix/Pro | TTML | Strict timing rules |
| Social (TikTok, IG) | Burn-in (ASS) | Embedded in video |
| General | SRT | Universal compatibility |
| Karaoke/effects | ASS | Advanced styling |
Ask user's target platform if not specified.
3. Professional Timing Standards
Netflix-compliant (default):
- Min duration: 5/6 second (0.833s)
- Max duration: 7 seconds
- Max chars/line: 42
- Max lines: 2
- Gap between subtitles: 2+ frames
Social media:
- Shorter segments (2-4 words)
- More frequent breaks
- Centered or dynamic positioning
4. Segmentation Rules
Break lines:
- After punctuation marks
- Before conjunctions (and, but, or)
- Before prepositions
Never separate:
- Article from noun
- Adjective from noun
- First name from last name
- Verb from subject pronoun
- Auxiliary from verb
5. Word-Level Timestamps
Use word timestamps for:
- Karaoke-style highlighting
- Precise sync verification
- TikTok/Instagram animated captions
- Quality checking transcript accuracy
Enable with --word-timestamps flag.
6. Speaker Identification
For multi-speaker content:
- Use diarization (pyannote local, or cloud APIs if configured)
- Format:
[Speaker 1]or[Name]if known - SDH format:
JOHN: What do you think?
7. Quality Verification
Before delivering:
- Check sync at start, middle, end
- Verify character limits per line
- Confirm speaker labels if multi-speaker
- Test burn-in render quality
Workflow
Basic Transcription
# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt
# Specify language
whisper video.mp4 --model turbo --language es --output_format srt
# Multiple formats
whisper video.mp4 --model turbo --output_format all
Word-Level Timestamps
# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt
# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-ivangdavila-video-captions": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Animations
Create performant web animations with proper accessibility and timing.
Arduino
Develop Arduino projects avoiding common wiring, power, and code pitfalls.
Bulgarian
Write Bulgarian that sounds human. Not formal, not robotic, not AI-generated.
Arabic
Write Arabic that sounds human. Not formal, not robotic, not AI-generated.
Assistant
Manage tasks, communications, and scheduling with proactive and organized support.