Official Verified media Safety 4/5

sergei-mikhailov-stt

Speech recognition from voice messages using Yandex SpeechKit (with an extensible architecture for other providers). Use when you need to convert a voice message to text.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/bzsega/sergei-mikhailov-stt

Download Source Code (.zip)

What This Skill Does

The sergei-mikhailov-stt skill is a powerful speech-to-text integration designed for the OpenClaw ecosystem. Its primary function is to intercept incoming voice messages from connected messengers and accurately transcribe them into readable text. Built with an extensible architecture, it defaults to the robust Yandex SpeechKit API but allows for modular expansion to other providers as needed. The skill manages the end-to-end lifecycle of an audio file, from local verification and format validation to format conversion via ffmpeg and eventual transcription.

Installation

To get started, first ensure your system has the necessary prerequisites, specifically the python3-venv package. Run 'clawhub install sergei-mikhailov-stt' to download the skill. Once installed, navigate to the workspace directory (~/.openclaw/workspace/skills/sergei-mikhailov-stt) and execute the 'bash setup.sh' script. This will automate the creation of a virtual environment, manage Python dependencies, and prepare the configuration files. After completing the setup, perform a validation check by running 'bash check.sh'. This diagnostic script ensures your environment, permissions, and API keys are configured correctly before live usage.

Use Cases

This skill is perfect for users who frequently receive voice notes but prefer text for better accessibility and searchability. It is ideal for hands-free environments where listening to audio is disruptive, for archiving voice conversations in searchable text formats, or for summarizing long voice messages quickly using downstream AI analysis. It bridges the gap between asynchronous audio communication and structured data management within OpenClaw.

Example Prompts

"Transcribe the voice message I just received in my last Telegram chat."
"Convert the audio file located at /home/user/.openclaw/media/inbound/note_001.ogg to text."
"Show me the text content of the latest voice memo from my professional channel."

Tips & Limitations

To ensure optimal performance, verify that your audio files remain under the 1MB limit for the Yandex SpeechKit v1 sync API. While the skill supports multiple formats like OGG, WAV, and MP3, keeping inputs in high-quality, clear speech formats yields the best recognition confidence. Always prioritize security by never sharing your API keys; perform any configuration updates manually in the protected .env or config files. If recognition quality drops, ensure your environment's ffmpeg installation is up-to-date, as this is critical for pre-processing non-standard audio files. Finally, remember that the skill uses absolute paths for execution to maintain security compliance with OpenClaw's approval prompts.

Read Full Documentation on GitHub

Metadata

Author@bzsega

Stars4097

Updated2026-04-14

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-bzsega-sergei-mikhailov-stt": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#speech-to-text#audio-transcription#yandex#messenger#voice-processing

Safety Score: 4/5

Flags: file-read, external-api, code-execution

Related Skills

smtools-image-generation

Generate images from text prompts using AI models via OpenRouter, Kie.ai, or YandexART. Use when the user asks to generate, create, draw, or illustrate an image.

bzsega 4097

sergei-mikhailov-tg-channel-reader

Read posts and comments from Telegram channels via MTProto (Pyrogram or Telethon). Fetch recent messages and discussion replies from public or private channels by time window.

bzsega 4097