Official Verified media Safety 4/5

google-gemini-media

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

Why use this skill?

Master image generation, video creation, and multimodal analysis using the Google Gemini API with the OpenClaw media skill integration.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/xsir0/google-gemini-media

Download Source Code (.zip)

What This Skill Does

The google-gemini-media skill provides a unified interface for leveraging Google's advanced multimodal Gemini API. It encapsulates six core capabilities: Nano Banana image generation, Veo 3.1 video generation, and comprehensive understanding modules for images, video, speech (TTS), and audio. By abstracting the complexity of the Google Gen AI SDK, this skill allows users to perform tasks ranging from text-to-image synthesis and high-fidelity video generation to complex file analysis, transcription, and time-stamped video evidence gathering. It serves as a central hub for developers and creators to build end-to-end media pipelines within their agents.

Installation

To integrate this skill into your environment, use the OpenClaw CLI tool. Run the following command in your terminal:

clawhub install openclaw/skills/skills/xsir0/google-gemini-media

Ensure your Node.js environment is version 18 or higher. You must set your GEMINI_API_KEY in your environment variables to authenticate requests. The skill relies on the @google/genai SDK, which will be resolved during the installation process.

Use Cases

Creative Production: Generate custom marketing visuals or promotional short-form video content using Veo 3.1.
Accessibility & Transcription: Automatically convert meeting recordings or lecture audio into searchable, time-stamped text logs.
Content Moderation/Analysis: Upload batch images or long-form videos for automated Q&A, scene classification, and summary generation.
Synthetic Media: Create high-quality, multi-speaker narration for audiobooks or video voiceovers.

Example Prompts

"Generate a 4K, 8-second video of a sunset over a neon-lit cyberpunk city using the Veo 3.1 model."
"Analyze this image and list all objects detected, then provide a short caption describing the emotional tone of the scene."
"Transcribe this audio file and provide a summary of the key discussion points with timestamps for when each topic starts."

Tips & Limitations

When using inline inputs, remember the 20MB request size limit. For larger files, utilize the Files API to upload assets before processing. Always check your API usage quotas to avoid unexpected service interruptions during high-volume multimodal batch processing.

Read Full Documentation on GitHub

Metadata

Author@xsir0

Stars879

Updated2026-02-11

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-xsir0-google-gemini-media": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#gemini#multimodal#generative-media#veo#tts

Safety Score: 4/5

Flags: external-api, file-read

Related Skills

frontend-design-extractor

Extract reusable UI/UX design systems from frontend codebases: design tokens, global styles, components, interaction patterns, and page templates. Use when analyzing any frontend repo (React/Vue/Angular/Next/Vite/etc.) to document or migrate UI/UX for reuse across projects. Focus on UI/UX only; explicitly ignore business logic and domain workflows.

xsir0 879