Official Verified

Video Captions

Generate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ivangdavila/video-captions

Download Source Code (.zip)

When to Use

User needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.

Quick Reference

Topic	File
Transcription engines	`engines.md`
Output formats	`formats.md`
Styling presets	`styling.md`
Platform requirements	`platforms.md`

Core Rules

1. Engine Selection by Context

Scenario	Engine	Why
Default (recommended)	Whisper local	100% offline, no data leaves machine
Apple Silicon	MLX Whisper	Native acceleration, still local
Word timestamps	whisper-timestamped	DTW alignment, still local

Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.

2. Format Selection by Platform

Platform	Format	Notes
YouTube	VTT or SRT	VTT preferred
Netflix/Pro	TTML	Strict timing rules
Social (TikTok, IG)	Burn-in (ASS)	Embedded in video
General	SRT	Universal compatibility
Karaoke/effects	ASS	Advanced styling

Ask user's target platform if not specified.

3. Professional Timing Standards

Netflix-compliant (default):

Min duration: 5/6 second (0.833s)
Max duration: 7 seconds
Max chars/line: 42
Max lines: 2
Gap between subtitles: 2+ frames

Social media:

Shorter segments (2-4 words)
More frequent breaks
Centered or dynamic positioning

4. Segmentation Rules

Break lines:

After punctuation marks
Before conjunctions (and, but, or)
Before prepositions

Never separate:

Article from noun
Adjective from noun
First name from last name
Verb from subject pronoun
Auxiliary from verb

5. Word-Level Timestamps

Use word timestamps for:

Karaoke-style highlighting
Precise sync verification
TikTok/Instagram animated captions
Quality checking transcript accuracy

Enable with --word-timestamps flag.

6. Speaker Identification

For multi-speaker content:

Use diarization (pyannote local, or cloud APIs if configured)
Format: [Speaker 1] or [Name] if known
SDH format: JOHN: What do you think?

7. Quality Verification

Before delivering:

Check sync at start, middle, end
Verify character limits per line
Confirm speaker labels if multi-speaker
Test burn-in render quality

Workflow

Basic Transcription

# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt

# Specify language
whisper video.mp4 --model turbo --language es --output_format srt

# Multiple formats
whisper video.mp4 --model turbo --output_format all

Word-Level Timestamps

# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt

# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate

Read Full Documentation on GitHub

Metadata

Author@ivangdavila

Stars2102

Updated2026-03-06

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ivangdavila-video-captions": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

Animations

Create performant web animations with proper accessibility and timing.

ivangdavila 2190

Arduino

Develop Arduino projects avoiding common wiring, power, and code pitfalls.

ivangdavila 2190

Bulgarian

Write Bulgarian that sounds human. Not formal, not robotic, not AI-generated.

ivangdavila 2190

Arabic

Write Arabic that sounds human. Not formal, not robotic, not AI-generated.

ivangdavila 2190

Assistant

Manage tasks, communications, and scheduling with proactive and organized support.

ivangdavila 2190