YouTube Model Feeder
Food for your model — extract transcripts, key frames, OCR, slides, and LLM summaries from YouTube videos into structured AI-ready knowledge.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/celstnblacc/youtube-model-feederYouTube Model Feeder
Food for your model.
Stop pausing videos every 30 seconds to screenshot, paste into Obsidian, and caption. Every 20-minute tutorial shouldn't take an hour to document.
YouTube Model Feeder extracts everything from a YouTube video — timestamped transcript, key frame snapshots, OCR of code and slides, presentation slide detection, and LLM-generated summaries — and packages it into structured knowledge your AI assistant can search, reference, and reason about.
Why This Exists
The problem isn't transcription — ten tools do that. The problem is structured context. When you feed a raw transcript to a model, it has no visual context. It doesn't know what was on screen when the speaker said "as you can see here." It can't read the code in the terminal, the diagram on the slide, or the config file being edited.
YouTube Model Feeder captures all of that. The output isn't just text — it's a knowledge bundle: transcript segments aligned to timestamps, screenshots of every key moment, OCR text from code snippets and slides, and an LLM summary that ties it all together.
Combined with obsidian-semantic-search (also on ClawHub), every video you watch becomes permanently searchable by meaning in your Obsidian vault.
What It Extracts
Full Pipeline
| Step | Tool | What it produces |
|---|---|---|
| Download | yt-dlp | Video + audio + metadata (title, duration, thumbnail) |
| Transcribe | Whisper (Ollama) or YouTube captions | Timestamped transcript segments |
| Frame Extraction | FFmpeg | Key frame snapshots every 5s (configurable) |
| Slide Detection | SSIM analysis (OpenCV) | Identifies presentation slides via structural similarity between frames |
| OCR | Tesseract | Reads code, terminal output, and text from captured frames |
| LLM Summary | Ollama / OpenAI / Anthropic | Structured markdown with sections, code blocks, and key takeaways |
Slide Detection (Deep)
Not just frame captures — intelligent slide boundary detection:
- Layout detection — classifies video as full-frame, picture-in-picture, or split panel
- SSIM transition scan — compares consecutive frames for structural changes (threshold: SSIM < 0.85)
- LLM disambiguation — borderline transitions (0.85–0.93 SSIM) sent to LLM for classification
- Slide grouping — merges transitions into slides with enforced minimum duration (3s)
- Final-state capture — saves the last frame of each slide as JPEG
- OCR extraction — runs Tesseract on each slide image
- Transcript alignment — maps transcript segments to slide time ranges
Output Formats
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-celstnblacc-youtube-model-feeder": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Obsidian Semantic Search
Semantic search across your Obsidian vaults using local embeddings (Ollama + pgvector). 10 MCP tools: hybrid/semantic/keyword search, file CRUD, batch reads, live re-indexing, and a monitoring dashboard. Fully local — no API keys, no cloud, zero cost.
Git Security Scanner
Unified security scanner that catches leaked secrets, credentials, and code vulnerabilities before they reach your remote. Wraps gitleaks (400+ secret patterns) and shipguard (48+ SAST rules) into a single tool with pre-commit hooks, on-demand scans, and full git history audits.