gemini-stt
Transcribe audio files using Google's Gemini API or Vertex AI
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/araa47/gemini-sttWhat This Skill Does
The gemini-stt skill is a robust audio-to-text transcription engine designed for the OpenClaw ecosystem. It leverages Google’s powerful Gemini multimodal models to convert speech from various file formats—including ogg, mp3, wav, and m4a—into high-quality text transcripts. By supporting both direct API keys and Google Cloud's Vertex AI (via Application Default Credentials), it provides a flexible architecture for everything from local hobbyist scripts to enterprise-grade production pipelines. It is optimized for the 'gemini-2.0-flash-lite' model by default, ensuring that users receive rapid, cost-effective transcriptions without sacrificing accuracy.
Installation
To integrate this skill into your environment, use the OpenClaw package manager. Simply execute the following command in your terminal:
clawhub install openclaw/skills/skills/araa47/gemini-stt
Ensure you have Python 3.10 or higher installed. After installation, you must configure your authentication by either setting the GEMINI_API_KEY environment variable or authenticating your environment via gcloud auth application-default login for Vertex AI support.
Use Cases
This skill is perfect for automating media processing workflows within Clawdbot. Common use cases include:
- Automating the transcription of Telegram voice notes for searchable archives.
- Creating text summaries from recorded meeting audio files.
- Developing voice-activated command interfaces where local audio input is transcribed before being passed to an LLM.
- Batch processing long-form audio files or podcast episodes into structured text reports.
Example Prompts
- "Transcribe the voice message located at ~/.clawdbot/media/inbound/user_voice_note.ogg and save the output to a text file."
- "Use the gemini-2.5-pro model to transcribe the meeting file in ~/downloads/meeting.mp3 for maximum accuracy."
- "Process the audio file ./audio_input.wav using my Vertex AI configuration and output the transcript to the terminal."
Tips & Limitations
To get the best results, ensure your audio files are clear and free of extreme background noise. While the skill supports multiple models, note that 'pro' models are significantly slower and more costly; reserve them for complex scenarios where transcription quality is paramount. If you encounter issues, verify your network connectivity to Google's APIs, as this skill requires outbound connectivity to function. Always ensure your environment variables are correctly loaded in your shell profile to avoid recurring configuration steps.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-araa47-gemini-stt": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-read, external-api
Related Skills
local-whisper
Local speech-to-text using OpenAI Whisper. Runs fully offline after model download. High quality transcription with multiple model sizes.
ez-unifi
Use when asked to manage UniFi network - list/restart/upgrade devices, block/unblock clients, manage WiFi networks, control PoE ports, manage traffic rules, create guest vouchers, or any UniFi controller task. Works with UDM Pro/SE, Dream Machine, Cloud Key Gen2+, or self-hosted controllers.
ez-google
Use when asked to send email, check inbox, read emails, check calendar, schedule meetings, create events, search Google Drive, create Google Docs, read or write spreadsheets, find contacts, or any task involving Gmail, Google Calendar, Drive, Docs, Sheets, Slides, or Contacts. Agent-friendly with hosted OAuth - no API keys needed.
local-stt
Local STT with selectable backends - Parakeet (best accuracy) or Whisper (fastest, multilingual).
md-to-pdf
Convert markdown files to clean, formatted PDFs using reportlab