video-understanding
Analyze videos with Google Gemini multimodal AI. Download from any URL (Loom, YouTube, TikTok, Vimeo, Twitter/X, Instagram, 1000+ sites) and get transcripts, descriptions, and answers to questions. Use when asked to watch, analyze, summarize, or transcribe a video, or answer questions about video content. Triggers on video URLs or requests involving video understanding.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/bill492/video-understandingWhat This Skill Does
The video-understanding skill leverages the power of Google Gemini's multimodal AI to provide deep insights into video content across the web. Whether you are dealing with a long-form YouTube tutorial, a brief social media clip from TikTok, or a professional demonstration on Loom, this skill acts as your personal video analyst. It automates the complex pipeline of downloading, processing, and interpreting video data, returning a clean, structured JSON response that includes a detailed transcript with timestamps, a comprehensive visual description, a concise summary, and speaker identification.
Installation
To integrate this skill into your OpenClaw agent, execute the following command in your terminal:
clawhub install openclaw/skills/skills/bill492/video-understanding
Ensure that you have yt-dlp and ffmpeg installed on your system (e.g., via brew install yt-dlp ffmpeg). Additionally, you must provide a valid GEMINI_API_KEY as an environment variable to authorize the connection to Google's multimodal AI models.
Use Cases
- Content Repurposing: Generate written blog posts or social media copy from recorded video meetings.
- Learning & Research: Quickly extract key takeaways or answers to specific questions from educational videos without watching the entire duration.
- Content Moderation/Compliance: Identify visual elements, UI patterns, or speakers within a video library.
- Accessibility: Create automated transcripts and visual descriptions for archived media that lacks metadata.
Example Prompts
- "Watch this YouTube tutorial on Python decorators and give me a 3-sentence summary of the main takeaway."
- "Can you watch this Loom video and list every step the user took in the settings menu?"
- "Transcribe this video from Twitter and identify all the speakers mentioned in the conversation."
Tips & Limitations
- YouTube Efficiency: The skill is optimized for YouTube; it avoids the download step and passes the URL directly to Gemini for instant processing.
- Handling Large Files: The Gemini File API supports large video files, but please monitor your internet connection speed when uploading massive files for analysis.
- Customization: If the default JSON output is too verbose, use the
-pflag to override the prompt and get concise, raw text tailored to your specific needs. - Cost Considerations: Since this relies on Gemini's API, ensure your account has sufficient quota if you intend to process a high volume of long-duration videos.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-bill492-video-understanding": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: network-access, file-write, file-read, external-api
Related Skills
skill-audit
Audit all installed skills for quality, duplicates, structural issues, and best-practice compliance. Use when asked to review, audit, lint, or check skills for problems. Triggers on "audit skills", "skill quality", "check my skills", "skill duplicates", "skill hygiene".
browser-read-x
Extract the main X/Twitter post or article content from a page that is already open in the browser (using browser act evaluate).
cf-crawl
Crawl websites using Cloudflare Browser Rendering /crawl API. Async multi-page crawl with markdown/HTML/JSON output, link following, pattern filtering, and AI-powered structured data extraction. Use when crawling entire sites or multiple pages, building knowledge bases, extracting structured data from websites, or when web_fetch is insufficient (JS rendering, multi-page, authenticated crawls).
sub-agents
Spawn and coordinate sub-agent sessions for parallel work. Use when delegating tasks (research, code, analysis), routing to appropriate models, or managing multi-agent workflows. Trigger on "spawn", "sub-agent", "delegate", "parallel tasks", or when a task would benefit from a different model.
browser-read
Extract readable content from browser pages as markdown. Use when web_fetch fails (bot protection, auth-required pages, Twitter/X, LinkedIn) and you already have the page open in the browser.