Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS
Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.
โ STT (Speech-to-Text) โ transcribe voice messages via faster-whisper โ TTS (Text-to-Speech) โ voice replies via Edge TTS ๐ฏ Result: voice โ text โ reply with voice
For Ubuntu create an isolated venv: python3 -m venv ~/.openclaw/workspace/voice-messages
Install packages in venv: ~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper What gets installed: faster-whisper โ Python library for transcription Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others. Size: ~250 MB
File: ~/.openclaw/workspace/voice-messages/transcribe.py #!/usr/bin/env python3 import argparse from faster_whisper import WhisperModel def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str: model = WhisperModel( model_name, device=device, compute_type="int8" if device == "cpu" else "float16", ) segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True) text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip() return text def main(): p = argparse.ArgumentParser() p.add_argument("--audio", required=True) p.add_argument("--model", default="small") p.add_argument("--lang", default="en") p.add_argument("--device", default="cpu", choices=["cpu", "cuda"]) args = p.parse_args() text = transcribe(args.audio, args.model, args.lang, args.device) print(text if text else "") if __name__ == "__main__": main() What the script does: Accepts audio file path (--audio) Loads Whisper model (--model): small by default Sets language (--lang): en for English Transcribes with VAD filter (Voice Activity Detection) Outputs clean text to stdout
chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py
Add to ~/.openclaw/openclaw.json: { "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } } } } Parameters: ParameterValueDescriptionenabledtrueEnable audio transcriptionmaxBytes20971520Max file size (20 MB)type"cli"Model type: CLI commandcommandPython pathPath to python in venvargsargument arrayArguments for script{{MediaPath}}placeholderReplaced with audio file pathtimeoutSeconds120Transcription timeout (2 minutes)
Add to ~/.openclaw/openclaw.json: { "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } } } } Parameters: ParameterValueDescriptionauto"inbound"Key mode! โ reply with voice only on incoming voice messagesprovider"edge"TTS provider (free, no API key)voice"en-US-JennyNeural"Voice (see available below)lang"en-US"Locale (en-US for US english)
{ "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } }, }, "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } }, "ackReactionScope": "group-mentions" } }
# Method 1: via openclaw CLI openclaw gateway restart # Method 2: via systemd systemctl --user restart openclaw-gateway # Check status systemctl --user status openclaw-gateway # Should show: active (running)
Action: Send a voice message to your Telegram bot Expected result: [Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text> Example response: [Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?
Action: After successful transcription, bot should send a voice reply Expected result: Voice file arrives in Telegram Voice note (round bubble) Expected behavior: Incoming voice โ bot replies with voice Text messages โ bot replies with text (this is normal!)
VoiceIDUsage exampleJennyen-US-JennyNeuralโ currentAnaen-US-AnaNeuralSofter
VoiceIDUsage exampleDmitryen-US-RogerNeuralMore bass How to change voice: cat ~/.openclaw/openclaw.json | \ jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json systemctl --user restart openclaw-gateway
{ "messages": { "tts": { "edge": { "voice": "en-US-JennyNeural", "lang": "en-US", "rate": "+10%", // Speed: -50% to +100% "pitch": "-5%", // Pitch: -50% to +50% "volume": "+5%" // Volume: -100% to +100% } } } }
Logs show: [ERROR] Transcription failed Possible causes: File too large โ > 20 MB # Solution: Increase maxBytes in config maxBytes: 52428800 # 50 MB Timeout โ transcription took > 2 minutes # Solution: Increase timeoutSeconds timeoutSeconds: 180 # 3 minutes Model not downloaded โ first run # Solution: Wait while it downloads (1-2 minutes) # Models are cached in ~/.cache/huggingface/
Possible causes: Reply too short (< 10 characters) TTS skips very short replies Solution: this is expected behavior auto: "inbound" but text message TTS in inbound mode replies with voice only on voice messages Text messages get text replies โ this is correct! Edge TTS unavailable # Check curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100 # If error โ temporarily unavailable
Whisper ModelEst. timeQualitytiny~5-10 secLowbase~10-20 secMediumsmall~20-40 secHigh โ currentmedium~40-80 secVery highlarge~80-160 secMaximum Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.
~/.cache/huggingface/ Models download automatically on first run.
After completing these steps: โ faster-whisper installed in venv โ transcribe.py script created โ OpenClaw configured (STT + TTS) โ Gateway restarted โ Voice messages working Now your Telegram bot: ๐๏ธ Accepts voice โ transcribes via faster-whisper ๐ค Replies with voice โ generates via Edge TTS ๐ฌ Accepts text โ replies with text (as usual) Useful links: OpenClaw docs: https://docs.openclaw.ai TTS docs: https://docs.openclaw.ai/tts Audio docs: https://docs.openclaw.ai/nodes/audio Install skills: npx clawhub search voice Created: 2026-03-01 for OpenClaw 2026.2.26
Messaging, meetings, inboxes, CRM, and teammate communication surfaces.
Largest current source with strong distribution and engagement signals.