ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified media Safety 4/5

google-gemini-media

Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".

Why use this skill?

Master image generation, video creation, and multimodal analysis using the Google Gemini API with the OpenClaw media skill integration.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/xsir0/google-gemini-media
Or

What This Skill Does

The google-gemini-media skill provides a unified interface for leveraging Google's advanced multimodal Gemini API. It encapsulates six core capabilities: Nano Banana image generation, Veo 3.1 video generation, and comprehensive understanding modules for images, video, speech (TTS), and audio. By abstracting the complexity of the Google Gen AI SDK, this skill allows users to perform tasks ranging from text-to-image synthesis and high-fidelity video generation to complex file analysis, transcription, and time-stamped video evidence gathering. It serves as a central hub for developers and creators to build end-to-end media pipelines within their agents.

Installation

To integrate this skill into your environment, use the OpenClaw CLI tool. Run the following command in your terminal:

clawhub install openclaw/skills/skills/xsir0/google-gemini-media

Ensure your Node.js environment is version 18 or higher. You must set your GEMINI_API_KEY in your environment variables to authenticate requests. The skill relies on the @google/genai SDK, which will be resolved during the installation process.

Use Cases

  • Creative Production: Generate custom marketing visuals or promotional short-form video content using Veo 3.1.
  • Accessibility & Transcription: Automatically convert meeting recordings or lecture audio into searchable, time-stamped text logs.
  • Content Moderation/Analysis: Upload batch images or long-form videos for automated Q&A, scene classification, and summary generation.
  • Synthetic Media: Create high-quality, multi-speaker narration for audiobooks or video voiceovers.

Example Prompts

  1. "Generate a 4K, 8-second video of a sunset over a neon-lit cyberpunk city using the Veo 3.1 model."
  2. "Analyze this image and list all objects detected, then provide a short caption describing the emotional tone of the scene."
  3. "Transcribe this audio file and provide a summary of the key discussion points with timestamps for when each topic starts."

Tips & Limitations

When using inline inputs, remember the 20MB request size limit. For larger files, utilize the Files API to upload assets before processing. Always check your API usage quotas to avoid unexpected service interruptions during high-volume multimodal batch processing.

Metadata

Author@xsir0
Stars879
Views1
Updated2026-02-11
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-xsir0-google-gemini-media": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#gemini#multimodal#generative-media#veo#tts
Safety Score: 4/5

Flags: external-api, file-read