ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

Video Captions

Generate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ivangdavila/video-captions
Or

When to Use

User needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.

Quick Reference

TopicFile
Transcription enginesengines.md
Output formatsformats.md
Styling presetsstyling.md
Platform requirementsplatforms.md

Core Rules

1. Engine Selection by Context

ScenarioEngineWhy
Default (recommended)Whisper local100% offline, no data leaves machine
Apple SiliconMLX WhisperNative acceleration, still local
Word timestampswhisper-timestampedDTW alignment, still local

Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.

2. Format Selection by Platform

PlatformFormatNotes
YouTubeVTT or SRTVTT preferred
Netflix/ProTTMLStrict timing rules
Social (TikTok, IG)Burn-in (ASS)Embedded in video
GeneralSRTUniversal compatibility
Karaoke/effectsASSAdvanced styling

Ask user's target platform if not specified.

3. Professional Timing Standards

Netflix-compliant (default):

  • Min duration: 5/6 second (0.833s)
  • Max duration: 7 seconds
  • Max chars/line: 42
  • Max lines: 2
  • Gap between subtitles: 2+ frames

Social media:

  • Shorter segments (2-4 words)
  • More frequent breaks
  • Centered or dynamic positioning

4. Segmentation Rules

Break lines:

  • After punctuation marks
  • Before conjunctions (and, but, or)
  • Before prepositions

Never separate:

  • Article from noun
  • Adjective from noun
  • First name from last name
  • Verb from subject pronoun
  • Auxiliary from verb

5. Word-Level Timestamps

Use word timestamps for:

  • Karaoke-style highlighting
  • Precise sync verification
  • TikTok/Instagram animated captions
  • Quality checking transcript accuracy

Enable with --word-timestamps flag.

6. Speaker Identification

For multi-speaker content:

  • Use diarization (pyannote local, or cloud APIs if configured)
  • Format: [Speaker 1] or [Name] if known
  • SDH format: JOHN: What do you think?

7. Quality Verification

Before delivering:

  • Check sync at start, middle, end
  • Verify character limits per line
  • Confirm speaker labels if multi-speaker
  • Test burn-in render quality

Workflow

Basic Transcription

# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt

# Specify language
whisper video.mp4 --model turbo --language es --output_format srt

# Multiple formats
whisper video.mp4 --model turbo --output_format all

Word-Level Timestamps

# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt

# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate

Metadata

Stars2102
Views1
Updated2026-03-06
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ivangdavila-video-captions": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.