ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified media Safety 5/5

gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/araa47/gemini-stt
Or

What This Skill Does

The gemini-stt skill is a robust audio-to-text transcription engine designed for the OpenClaw ecosystem. It leverages Google’s powerful Gemini multimodal models to convert speech from various file formats—including ogg, mp3, wav, and m4a—into high-quality text transcripts. By supporting both direct API keys and Google Cloud's Vertex AI (via Application Default Credentials), it provides a flexible architecture for everything from local hobbyist scripts to enterprise-grade production pipelines. It is optimized for the 'gemini-2.0-flash-lite' model by default, ensuring that users receive rapid, cost-effective transcriptions without sacrificing accuracy.

Installation

To integrate this skill into your environment, use the OpenClaw package manager. Simply execute the following command in your terminal:

clawhub install openclaw/skills/skills/araa47/gemini-stt

Ensure you have Python 3.10 or higher installed. After installation, you must configure your authentication by either setting the GEMINI_API_KEY environment variable or authenticating your environment via gcloud auth application-default login for Vertex AI support.

Use Cases

This skill is perfect for automating media processing workflows within Clawdbot. Common use cases include:

  • Automating the transcription of Telegram voice notes for searchable archives.
  • Creating text summaries from recorded meeting audio files.
  • Developing voice-activated command interfaces where local audio input is transcribed before being passed to an LLM.
  • Batch processing long-form audio files or podcast episodes into structured text reports.

Example Prompts

  • "Transcribe the voice message located at ~/.clawdbot/media/inbound/user_voice_note.ogg and save the output to a text file."
  • "Use the gemini-2.5-pro model to transcribe the meeting file in ~/downloads/meeting.mp3 for maximum accuracy."
  • "Process the audio file ./audio_input.wav using my Vertex AI configuration and output the transcript to the terminal."

Tips & Limitations

To get the best results, ensure your audio files are clear and free of extreme background noise. While the skill supports multiple models, note that 'pro' models are significantly slower and more costly; reserve them for complex scenarios where transcription quality is paramount. If you encounter issues, verify your network connectivity to Google's APIs, as this skill requires outbound connectivity to function. Always ensure your environment variables are correctly loaded in your shell profile to avoid recurring configuration steps.

Metadata

Author@araa47
Stars4473
Views0
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-araa47-gemini-stt": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#transcription#audio#gemini#speech-to-text#voice
Safety Score: 5/5

Flags: file-read, external-api