ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified communication Safety 4/5

voice-assistant

Real-time voice assistant for OpenClaw. Streams mic audio through configurable STT (Deepgram or ElevenLabs) into your OpenClaw agent, then speaks the response via configurable TTS (Deepgram Aura or ElevenLabs). Sub-2s time-to-first-audio with full streaming at every stage.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/charantejmandali18/voice-assistant
Or

What This Skill Does

The voice-assistant skill transforms your OpenClaw agent into a high-performance, real-time conversational partner. By bridging audio input from your browser directly to an OpenAI-compatible gateway, it enables a seamless voice-to-voice experience. The architecture is engineered for low latency, featuring sub-2 second time-to-first-audio performance by utilizing continuous streaming at every link in the chain: from the browser microphone to the STT processor, through the LLM logic, and back out via the TTS engine. It supports high-fidelity providers like Deepgram and ElevenLabs, allowing you to balance cost, speed, and voice realism.

Installation

To get started, first ensure your OpenClaw environment is configured. Use the following command in your terminal to integrate the skill:

clawhub install openclaw/skills/skills/charantejmandali18/voice-assistant

Navigate to the skill's base directory, copy the environment template (cp .env.example .env), and populate your specific API keys for your chosen providers. Once configured, launch the server using uv run scripts/server.py. Access the interface via your browser at http://localhost:7860 to begin interacting.

Use Cases

This skill is perfect for scenarios requiring hands-free agent interaction. Use it for voice-controlled home automation, conducting mock interviews where the agent provides instant feedback, or as a real-time brainstorming assistant that captures your spoken thoughts without the need for manual transcription. It is also highly effective for accessibility-focused workflows where typing is not the primary input method.

Example Prompts

  1. "Hey, I'm working on a Python script for data processing. Can you walk me through the best way to optimize a loop that handles large datasets?"
  2. "Draft a summary of our meeting notes. I want you to focus on the action items for the marketing team and the deadlines we discussed."
  3. "Summarize the last three messages in this thread and propose a professional response that acknowledges the client's concern regarding the budget."

Tips & Limitations

For the best experience, ensure your network connection is stable, as high jitter can disrupt the WebSocket streaming. If you experience lag, try lowering your audio sample rate or switching to a faster STT provider. Note that the VAD (Voice Activity Detection) threshold can be fine-tuned via VOICE_VAD_SILENCE_MS; increase this value if the agent is cutting you off mid-sentence, or decrease it if the agent is too slow to respond to your silences.

Metadata

Stars3875
Views0
Updated2026-04-07
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-charantejmandali18-voice-assistant": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#voice#streaming#tts#stt#real-time
Safety Score: 4/5

Flags: network-access, external-api