ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified media Safety 4/5

Speech to Text Transcription

Transcribe audio and video files to text with speaker detection, timestamps, and format conversion.

Why use this skill?

Convert audio and video to text with this OpenClaw skill. Features include speaker diarization, multi-format support, and privacy-focused local Whisper transcription.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ivangdavila/speech-to-text-transcription
Or

What This Skill Does

The Speech to Text Transcription skill is a comprehensive tool designed to transform audio and video content into accurate, readable text. It provides high-performance transcription capabilities for a wide variety of inputs including voice memos, lectures, professional interviews, and meeting recordings. Beyond simple conversion, this skill excels at advanced tasks such as speaker diarization, timestamp generation, and multiple format exports. It acts as an intelligent layer between raw audio files and structured text data, allowing you to easily manage, store, and analyze spoken content directly within your OpenClaw environment.

Installation

To integrate this skill into your workflow, use the standard OpenClaw installation command via your terminal:

clawhub install openclaw/skills/skills/ivangdavila/speech-to-text-transcription

Ensure you have ffmpeg installed on your system as it is a critical dependency for audio processing, splitting, and conversion tasks. If you plan to use cloud-based providers for advanced features like diarization or high-accuracy models, please ensure your API keys for OpenAI, AssemblyAI, or Deepgram are stored in your environment variables.

Use Cases

This skill is perfect for professionals and students who manage significant amounts of spoken data. Use it to transcribe:

  • Long-form meeting recordings for searchable archives.
  • Podcasts and interviews for show notes and blog content.
  • Voice memos for personal productivity and note-taking.
  • Educational lectures to create study guides and summaries.

Example Prompts

  1. "Transcribe the interview file named 'ceo_interview.mp3' and make sure to identify who is speaking so I can distinguish between the CEO and the interviewer."
  2. "I have a two-hour lecture recording at '/home/user/downloads/physics_101.mp4'. Please process this, generate an SRT file for subtitles, and extract the key action items at the end."
  3. "Transcribe this voice memo from the URL [link] using the local Whisper model to keep it private and offline."

Tips & Limitations

  • Pre-processing: For best results with noisy audio, ensure the file is cleaned using ffmpeg before transcription.
  • File Size Management: Do not attempt to process files larger than 25MB or 2 hours in a single step; allow the agent to chunk the file to prevent timeouts.
  • Provider Selection: Choose your provider wisely. Use local Whisper for private, free transcription. Use AssemblyAI when you need precise speaker labeling. Use OpenAI Whisper API for the highest possible accuracy on complex audio.

Metadata

Stars2102
Views0
Updated2026-03-06
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ivangdavila-speech-to-text-transcription": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#transcription#audio#whisper#diarization#productivity
Safety Score: 4/5

Flags: file-write, file-read, external-api, code-execution