← All skills

Tencent SkillHub · Communication & Collaboration

Voice messaging setup

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

skill openclawclawhub Free

0 Downloads

0 Stars

0 Installs

0 Score

High Signal

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup

Download the package from Yavira.
Extract the archive and review SKILL.md first.
Import or place the package into your OpenClaw setup.

Requirements

Target platform: OpenClaw
Install method: Manual import
Extraction: Extract archive
Prerequisites: OpenClaw
Primary doc: SKILL.md

Package facts

Download mode: Yavira redirect
Package format: ZIP package
Source platform: Tencent SkillHub
What's included: SKILL.md

Validation

Use the Yavira download entry.
Review SKILL.md after the package is downloaded.
Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

Download the package from Yavira.
Extract it into a folder your agent can access.
Paste one of the prompts below and point your agent at the extracted folder.

New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Open Send to Agent page Open JSON manifest Open Markdown brief

Trust & source

Release facts

Source: Tencent SkillHub
Verification: Indexed source record
Version: 1.0.3

Provenance

Publisher: aksenkin
Source page: View original listing
Canonical URL: Open canonical page

Documentation

ClawHub primary doc Primary doc: SKILL.md 20 sections Open source page

Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

✅ STT (Speech-to-Text) — transcribe voice messages via faster-whisper ✅ TTS (Text-to-Speech) — voice replies via Edge TTS 🎯 Result: voice → text → reply with voice

1. Create virtual environment (venv)

For Ubuntu create an isolated venv: python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

Install packages in venv: ~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper What gets installed: faster-whisper — Python library for transcription Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others. Size: ~250 MB

Path and content

File: ~/.openclaw/workspace/voice-messages/transcribe.py #!/usr/bin/env python3 import argparse from faster_whisper import WhisperModel def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str: model = WhisperModel( model_name, device=device, compute_type="int8" if device == "cpu" else "float16", ) segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True) text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip() return text def main(): p = argparse.ArgumentParser() p.add_argument("--audio", required=True) p.add_argument("--model", default="small") p.add_argument("--lang", default="en") p.add_argument("--device", default="cpu", choices=["cpu", "cuda"]) args = p.parse_args() text = transcribe(args.audio, args.model, args.lang, args.device) print(text if text else "") if __name__ == "__main__": main() What the script does: Accepts audio file path (--audio) Loads Whisper model (--model): small by default Sets language (--lang): en for English Transcribes with VAD filter (Voice Activity Detection) Outputs clean text to stdout

Make file executable:

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

1. Configure STT (tools.media.audio)

Add to ~/.openclaw/openclaw.json: { "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } } } } Parameters: ParameterValueDescriptionenabledtrueEnable audio transcriptionmaxBytes20971520Max file size (20 MB)type"cli"Model type: CLI commandcommandPython pathPath to python in venvargsargument arrayArguments for script{{MediaPath}}placeholderReplaced with audio file pathtimeoutSeconds120Transcription timeout (2 minutes)

2. Configure TTS (messages.tts)

Add to ~/.openclaw/openclaw.json: { "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } } } } Parameters: ParameterValueDescriptionauto"inbound"Key mode! — reply with voice only on incoming voice messagesprovider"edge"TTS provider (free, no API key)voice"en-US-JennyNeural"Voice (see available below)lang"en-US"Locale (en-US for US english)

3. Full configuration example

{ "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } }, }, "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } }, "ackReactionScope": "group-mentions" } }

Restart Gateway

# Method 1: via openclaw CLI openclaw gateway restart # Method 2: via systemd systemctl --user restart openclaw-gateway # Check status systemctl --user status openclaw-gateway # Should show: active (running)

Test STT (transcription)

Action: Send a voice message to your Telegram bot Expected result: [Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text> Example response: [Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply Expected result: Voice file arrives in Telegram Voice note (round bubble) Expected behavior: Incoming voice → bot replies with voice Text messages → bot replies with text (this is normal!)

Female voices

VoiceIDUsage exampleJennyen-US-JennyNeural← currentAnaen-US-AnaNeuralSofter

Male voices

VoiceIDUsage exampleDmitryen-US-RogerNeuralMore bass How to change voice: cat ~/.openclaw/openclaw.json | \ jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json systemctl --user restart openclaw-gateway

Adjusting speed, pitch, volume

{ "messages": { "tts": { "edge": { "voice": "en-US-JennyNeural", "lang": "en-US", "rate": "+10%", // Speed: -50% to +100% "pitch": "-5%", // Pitch: -50% to +50% "volume": "+5%" // Volume: -100% to +100% } } } }

Problem: Voice not transcribed

Logs show: [ERROR] Transcription failed Possible causes: File too large — > 20 MB # Solution: Increase maxBytes in config maxBytes: 52428800 # 50 MB Timeout — transcription took > 2 minutes # Solution: Increase timeoutSeconds timeoutSeconds: 180 # 3 minutes Model not downloaded — first run # Solution: Wait while it downloads (1-2 minutes) # Models are cached in ~/.cache/huggingface/

Problem: No voice reply

Possible causes: Reply too short (< 10 characters) TTS skips very short replies Solution: this is expected behavior auto: "inbound" but text message TTS in inbound mode replies with voice only on voice messages Text messages get text replies — this is correct! Edge TTS unavailable # Check curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100 # If error — temporarily unavailable

Transcription time (Raspberry Pi 4/ARM)

Whisper ModelEst. timeQualitytiny~5-10 secLowbase~10-20 secMediumsmall~20-40 secHigh ← currentmedium~40-80 secVery highlarge~80-160 secMaximum Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.

Where Whisper models are stored

~/.cache/huggingface/ Models download automatically on first run.

Done! 🎉

After completing these steps: ✅ faster-whisper installed in venv ✅ transcribe.py script created ✅ OpenClaw configured (STT + TTS) ✅ Gateway restarted ✅ Voice messages working Now your Telegram bot: 🎙️ Accepts voice → transcribes via faster-whisper 🎤 Replies with voice → generates via Edge TTS 💬 Accepts text → replies with text (as usual) Useful links: OpenClaw docs: https://docs.openclaw.ai TTS docs: https://docs.openclaw.ai/tts Audio docs: https://docs.openclaw.ai/nodes/audio Install skills: npx clawhub search voice Created: 2026-03-01 for OpenClaw 2026.2.26

Category context

Messaging, meetings, inboxes, CRM, and teammate communication surfaces.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package

1 Docs

SKILL.md Primary doc