Speech to Text Transcription
Transcribe audio and video files to text with speaker detection, timestamps, and format conversion.
Why use this skill?
Convert audio and video to text with this OpenClaw skill. Features include speaker diarization, multi-format support, and privacy-focused local Whisper transcription.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/ivangdavila/speech-to-text-transcriptionWhat This Skill Does
The Speech to Text Transcription skill is a comprehensive tool designed to transform audio and video content into accurate, readable text. It provides high-performance transcription capabilities for a wide variety of inputs including voice memos, lectures, professional interviews, and meeting recordings. Beyond simple conversion, this skill excels at advanced tasks such as speaker diarization, timestamp generation, and multiple format exports. It acts as an intelligent layer between raw audio files and structured text data, allowing you to easily manage, store, and analyze spoken content directly within your OpenClaw environment.
Installation
To integrate this skill into your workflow, use the standard OpenClaw installation command via your terminal:
clawhub install openclaw/skills/skills/ivangdavila/speech-to-text-transcription
Ensure you have ffmpeg installed on your system as it is a critical dependency for audio processing, splitting, and conversion tasks. If you plan to use cloud-based providers for advanced features like diarization or high-accuracy models, please ensure your API keys for OpenAI, AssemblyAI, or Deepgram are stored in your environment variables.
Use Cases
This skill is perfect for professionals and students who manage significant amounts of spoken data. Use it to transcribe:
- Long-form meeting recordings for searchable archives.
- Podcasts and interviews for show notes and blog content.
- Voice memos for personal productivity and note-taking.
- Educational lectures to create study guides and summaries.
Example Prompts
- "Transcribe the interview file named 'ceo_interview.mp3' and make sure to identify who is speaking so I can distinguish between the CEO and the interviewer."
- "I have a two-hour lecture recording at '/home/user/downloads/physics_101.mp4'. Please process this, generate an SRT file for subtitles, and extract the key action items at the end."
- "Transcribe this voice memo from the URL [link] using the local Whisper model to keep it private and offline."
Tips & Limitations
- Pre-processing: For best results with noisy audio, ensure the file is cleaned using
ffmpegbefore transcription. - File Size Management: Do not attempt to process files larger than 25MB or 2 hours in a single step; allow the agent to chunk the file to prevent timeouts.
- Provider Selection: Choose your provider wisely. Use local Whisper for private, free transcription. Use AssemblyAI when you need precise speaker labeling. Use OpenAI Whisper API for the highest possible accuracy on complex audio.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-ivangdavila-speech-to-text-transcription": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-write, file-read, external-api, code-execution
Related Skills
Animations
Create performant web animations with proper accessibility and timing.
Arduino
Develop Arduino projects avoiding common wiring, power, and code pitfalls.
Bulgarian
Write Bulgarian that sounds human. Not formal, not robotic, not AI-generated.
Arabic
Write Arabic that sounds human. Not formal, not robotic, not AI-generated.
Assistant
Manage tasks, communications, and scheduling with proactive and organized support.