Official Verified media Safety 5/5

gemini-stt

Transcribe audio files using Google's Gemini API or Vertex AI

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/araa47/gemini-stt

Download Source Code (.zip)

What This Skill Does

The gemini-stt skill is a robust audio-to-text transcription engine designed for the OpenClaw ecosystem. It leverages Google’s powerful Gemini multimodal models to convert speech from various file formats—including ogg, mp3, wav, and m4a—into high-quality text transcripts. By supporting both direct API keys and Google Cloud's Vertex AI (via Application Default Credentials), it provides a flexible architecture for everything from local hobbyist scripts to enterprise-grade production pipelines. It is optimized for the 'gemini-2.0-flash-lite' model by default, ensuring that users receive rapid, cost-effective transcriptions without sacrificing accuracy.

Installation

To integrate this skill into your environment, use the OpenClaw package manager. Simply execute the following command in your terminal:

clawhub install openclaw/skills/skills/araa47/gemini-stt

Ensure you have Python 3.10 or higher installed. After installation, you must configure your authentication by either setting the GEMINI_API_KEY environment variable or authenticating your environment via gcloud auth application-default login for Vertex AI support.

Use Cases

This skill is perfect for automating media processing workflows within Clawdbot. Common use cases include:

Automating the transcription of Telegram voice notes for searchable archives.
Creating text summaries from recorded meeting audio files.
Developing voice-activated command interfaces where local audio input is transcribed before being passed to an LLM.
Batch processing long-form audio files or podcast episodes into structured text reports.

Example Prompts

"Transcribe the voice message located at ~/.clawdbot/media/inbound/user_voice_note.ogg and save the output to a text file."
"Use the gemini-2.5-pro model to transcribe the meeting file in ~/downloads/meeting.mp3 for maximum accuracy."
"Process the audio file ./audio_input.wav using my Vertex AI configuration and output the transcript to the terminal."

Tips & Limitations

To get the best results, ensure your audio files are clear and free of extreme background noise. While the skill supports multiple models, note that 'pro' models are significantly slower and more costly; reserve them for complex scenarios where transcription quality is paramount. If you encounter issues, verify your network connectivity to Google's APIs, as this skill requires outbound connectivity to function. Always ensure your environment variables are correctly loaded in your shell profile to avoid recurring configuration steps.

Read Full Documentation on GitHub

Metadata

Author@araa47

Stars4473

Updated2026-05-01

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-araa47-gemini-stt": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#transcription#audio#gemini#speech-to-text#voice

Safety Score: 5/5

Flags: file-read, external-api

Related Skills

local-whisper

Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.

araa47 4473

ez-unifi

Use when asked to manage UniFi network - list/restart/upgrade devices, block/unblock clients, manage WiFi networks, control PoE ports, manage traffic rules, create guest vouchers, or any UniFi controller task. Works with UDM Pro/SE, Dream Machine, Cloud Key Gen2+, or self-hosted controllers.

araa47 4473

ez-google

Use when asked to send email, check inbox, read emails, check calendar, schedule meetings, create events, search Google Drive, create Google Docs, read or write spreadsheets, find contacts, or any task involving Gmail, Google Calendar, Drive, Docs, Sheets, Slides, or Contacts. Agent-friendly with hosted OAuth - no API keys needed.

araa47 4473

local-stt

Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).

araa47 4473

md-to-pdf

Convert markdown files to clean, formatted PDFs using reportlab

araa47 4473