google-gemini-media
Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".
Why use this skill?
Master image generation, video creation, and multimodal analysis using the Google Gemini API with the OpenClaw media skill integration.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/xsir0/google-gemini-mediaWhat This Skill Does
The google-gemini-media skill provides a unified interface for leveraging Google's advanced multimodal Gemini API. It encapsulates six core capabilities: Nano Banana image generation, Veo 3.1 video generation, and comprehensive understanding modules for images, video, speech (TTS), and audio. By abstracting the complexity of the Google Gen AI SDK, this skill allows users to perform tasks ranging from text-to-image synthesis and high-fidelity video generation to complex file analysis, transcription, and time-stamped video evidence gathering. It serves as a central hub for developers and creators to build end-to-end media pipelines within their agents.
Installation
To integrate this skill into your environment, use the OpenClaw CLI tool. Run the following command in your terminal:
clawhub install openclaw/skills/skills/xsir0/google-gemini-media
Ensure your Node.js environment is version 18 or higher. You must set your GEMINI_API_KEY in your environment variables to authenticate requests. The skill relies on the @google/genai SDK, which will be resolved during the installation process.
Use Cases
- Creative Production: Generate custom marketing visuals or promotional short-form video content using Veo 3.1.
- Accessibility & Transcription: Automatically convert meeting recordings or lecture audio into searchable, time-stamped text logs.
- Content Moderation/Analysis: Upload batch images or long-form videos for automated Q&A, scene classification, and summary generation.
- Synthetic Media: Create high-quality, multi-speaker narration for audiobooks or video voiceovers.
Example Prompts
- "Generate a 4K, 8-second video of a sunset over a neon-lit cyberpunk city using the Veo 3.1 model."
- "Analyze this image and list all objects detected, then provide a short caption describing the emotional tone of the scene."
- "Transcribe this audio file and provide a summary of the key discussion points with timestamps for when each topic starts."
Tips & Limitations
When using inline inputs, remember the 20MB request size limit. For larger files, utilize the Files API to upload assets before processing. Always check your API usage quotas to avoid unexpected service interruptions during high-volume multimodal batch processing.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-xsir0-google-gemini-media": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: external-api, file-read