ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

YouTube Model Feeder

Food for your model — extract transcripts, key frames, OCR, slides, and LLM summaries from YouTube videos into structured AI-ready knowledge.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/celstnblacc/youtube-model-feeder
Or

YouTube Model Feeder

Food for your model.

Stop pausing videos every 30 seconds to screenshot, paste into Obsidian, and caption. Every 20-minute tutorial shouldn't take an hour to document.

YouTube Model Feeder extracts everything from a YouTube video — timestamped transcript, key frame snapshots, OCR of code and slides, presentation slide detection, and LLM-generated summaries — and packages it into structured knowledge your AI assistant can search, reference, and reason about.

Why This Exists

The problem isn't transcription — ten tools do that. The problem is structured context. When you feed a raw transcript to a model, it has no visual context. It doesn't know what was on screen when the speaker said "as you can see here." It can't read the code in the terminal, the diagram on the slide, or the config file being edited.

YouTube Model Feeder captures all of that. The output isn't just text — it's a knowledge bundle: transcript segments aligned to timestamps, screenshots of every key moment, OCR text from code snippets and slides, and an LLM summary that ties it all together.

Combined with obsidian-semantic-search (also on ClawHub), every video you watch becomes permanently searchable by meaning in your Obsidian vault.

What It Extracts

Full Pipeline

StepToolWhat it produces
Downloadyt-dlpVideo + audio + metadata (title, duration, thumbnail)
TranscribeWhisper (Ollama) or YouTube captionsTimestamped transcript segments
Frame ExtractionFFmpegKey frame snapshots every 5s (configurable)
Slide DetectionSSIM analysis (OpenCV)Identifies presentation slides via structural similarity between frames
OCRTesseractReads code, terminal output, and text from captured frames
LLM SummaryOllama / OpenAI / AnthropicStructured markdown with sections, code blocks, and key takeaways

Slide Detection (Deep)

Not just frame captures — intelligent slide boundary detection:

  1. Layout detection — classifies video as full-frame, picture-in-picture, or split panel
  2. SSIM transition scan — compares consecutive frames for structural changes (threshold: SSIM < 0.85)
  3. LLM disambiguation — borderline transitions (0.85–0.93 SSIM) sent to LLM for classification
  4. Slide grouping — merges transitions into slides with enforced minimum duration (3s)
  5. Final-state capture — saves the last frame of each slide as JPEG
  6. OCR extraction — runs Tesseract on each slide image
  7. Transcript alignment — maps transcript segments to slide time ranges

Output Formats

Metadata

Stars3875
Views1
Updated2026-04-07
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-celstnblacc-youtube-model-feeder": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.