Official Verified media Safety 5/5

elevenlabs-stt

使用 ElevenLabs Scribe V2 进行语音转文字。当用户想要语音识别、音频转录、语音转文字，或提到 elevenlabs、scribe 时使用此 skill。

Why use this skill?

Integrate ElevenLabs Scribe V2 into OpenClaw for high-speed speech-to-text, speaker diarization, and accurate multilingual audio transcription.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/hexiaochun/sutui-elevenlabs-stt

Download Source Code (.zip)

What This Skill Does

The elevenlabs-stt skill leverages the powerful ElevenLabs Scribe V2 model to provide high-fidelity speech-to-text transcription services within your AI environment. Designed for speed and accuracy, this tool converts audio files (including mp3, ogg, wav, m4a, and aac) into structured text. Unlike basic transcription tools, Scribe V2 offers advanced capabilities such as automatic language detection, speaker diarization (to distinguish between different speakers in a recording), and audio event tagging (e.g., laughter or applause). This makes it an essential tool for transcribing meetings, interviews, or multimedia content where context and speaker identification are critical.

Installation

You can easily integrate this capability into your OpenClaw agent by running the following command in your terminal: clawhub install openclaw/skills/skills/hexiaochun/sutui-elevenlabs-stt

Use Cases

This skill is highly versatile and fits into various professional and personal workflows:

Meeting Transcription: Automatically generate meeting minutes by separating different speakers in long audio recordings.
Content Creation: Quickly transcribe podcasts or video interviews to create blog posts or accessibility subtitles.
Professional Research: Process qualitative data from research interviews by utilizing the keyterms feature to ensure specialized industry vocabulary is recognized correctly.
Multilingual Support: Process content across various languages including English, Mandarin, Japanese, Korean, Spanish, French, and German with high accuracy.

Example Prompts

"Could you transcribe the audio from this meeting recording at https://example.com/meeting.mp3 and make sure to separate the speakers?"
"I have an interview file here: https://example.com/interview.wav. Please use ElevenLabs Scribe to convert it to text and ensure all specialized medical terms are recognized."
"Please perform a speech-to-text conversion on this Japanese audio file: https://example.com/jpn_session.m4a. Detect the language automatically and include tags for non-speech audio events."

Tips & Limitations

Precision: While automatic language detection works well, explicitly specifying the language_code often results in higher transcription accuracy, especially for audio with background noise or diverse accents.
Cost Efficiency: Using the keyterms parameter improves accuracy for technical jargon but increases the processing cost by 30%. Use it selectively for files where precision in specialized vocabulary is paramount.
Diarization: Always enable diarize: true for interviews or multi-party conversations to ensure the output identifies unique speakers.
Limits: The keyterms feature supports up to 100 terms, with each term limited to 50 characters. Ensure your list is optimized for the specific context of your audio file.

Read Full Documentation on GitHub

Metadata

Author@hexiaochun

Stars2387

Updated2026-03-09

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-hexiaochun-sutui-elevenlabs-stt": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Related Skills

audio-transcriber

Transform audio recordings into professional Markdown documentation with intelligent summaries using LLM integration

bingze00000 4473

youtube-summarizer

Automatically fetch YouTube video transcripts, generate structured summaries, and send full transcripts to messaging platforms. Detects YouTube URLs and provides metadata, key insights, and downloadable transcripts.

abe238 4473

elevenlabs-twilio-memory-bridge

FastAPI personalization webhook that adds persistent caller memory and dynamic context injection to ElevenLabs Conversational AI agents on Twilio. No audio proxying, file-based persistence, OpenClaw compatible.

britrik 4190

voice-note-to-midi

Convert voice notes, humming, and melodic audio recordings to quantized MIDI files using ML-based pitch detection and intelligent post-processing

danbennettuk 3376

spaces-listener

Record, transcribe, and summarize X/Twitter Spaces — live or replays. Auto-downloads audio via yt-dlp, transcribes with Whisper, and generates AI summaries.

jamesalmeida 2032