โ† All skills
Tencent SkillHub ยท Communication & Collaboration

Voice messaging setup

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Full voice message setup (STT + TTS) for OpenClaw using faster-whisper and Edge TTS

โฌ‡ 0 downloads โ˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
SKILL.md

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.3

Documentation

ClawHub primary doc Primary doc: SKILL.md 20 sections Open source page

Voice Messages (STT + TTS) for OpenClaw ๐ŸŽ™๏ธ

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

What we configure

โœ… STT (Speech-to-Text) โ€” transcribe voice messages via faster-whisper โœ… TTS (Text-to-Speech) โ€” voice replies via Edge TTS ๐ŸŽฏ Result: voice โ†’ text โ†’ reply with voice

1. Create virtual environment (venv)

For Ubuntu create an isolated venv: python3 -m venv ~/.openclaw/workspace/voice-messages

2. Install faster-whisper

Install packages in venv: ~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper What gets installed: faster-whisper โ€” Python library for transcription Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others. Size: ~250 MB

Path and content

File: ~/.openclaw/workspace/voice-messages/transcribe.py #!/usr/bin/env python3 import argparse from faster_whisper import WhisperModel def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str: model = WhisperModel( model_name, device=device, compute_type="int8" if device == "cpu" else "float16", ) segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True) text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip() return text def main(): p = argparse.ArgumentParser() p.add_argument("--audio", required=True) p.add_argument("--model", default="small") p.add_argument("--lang", default="en") p.add_argument("--device", default="cpu", choices=["cpu", "cuda"]) args = p.parse_args() text = transcribe(args.audio, args.model, args.lang, args.device) print(text if text else "") if __name__ == "__main__": main() What the script does: Accepts audio file path (--audio) Loads Whisper model (--model): small by default Sets language (--lang): en for English Transcribes with VAD filter (Voice Activity Detection) Outputs clean text to stdout

Make file executable:

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

1. Configure STT (tools.media.audio)

Add to ~/.openclaw/openclaw.json: { "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } } } } Parameters: ParameterValueDescriptionenabledtrueEnable audio transcriptionmaxBytes20971520Max file size (20 MB)type"cli"Model type: CLI commandcommandPython pathPath to python in venvargsargument arrayArguments for script{{MediaPath}}placeholderReplaced with audio file pathtimeoutSeconds120Transcription timeout (2 minutes)

2. Configure TTS (messages.tts)

Add to ~/.openclaw/openclaw.json: { "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } } } } Parameters: ParameterValueDescriptionauto"inbound"Key mode! โ€” reply with voice only on incoming voice messagesprovider"edge"TTS provider (free, no API key)voice"en-US-JennyNeural"Voice (see available below)lang"en-US"Locale (en-US for US english)

3. Full configuration example

{ "tools": { "media": { "audio": { "enabled": true, "maxBytes": 20971520, "models": [ { "type": "cli", "command": "~/.openclaw/workspace/voice-messages/bin/python", "args": [ "~/.openclaw/workspace/voice-messages/transcribe.py", "--audio", "{{MediaPath}}", "--lang", "en", "--model", "small" ], "timeoutSeconds": 120 } ] } }, }, "messages": { "tts": { "auto": "inbound", "provider": "edge", "edge": { "voice": "en-US-JennyNeural", "lang": "en-US" } }, "ackReactionScope": "group-mentions" } }

Restart Gateway

# Method 1: via openclaw CLI openclaw gateway restart # Method 2: via systemd systemctl --user restart openclaw-gateway # Check status systemctl --user status openclaw-gateway # Should show: active (running)

Test STT (transcription)

Action: Send a voice message to your Telegram bot Expected result: [Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text> Example response: [Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply Expected result: Voice file arrives in Telegram Voice note (round bubble) Expected behavior: Incoming voice โ†’ bot replies with voice Text messages โ†’ bot replies with text (this is normal!)

Female voices

VoiceIDUsage exampleJennyen-US-JennyNeuralโ† currentAnaen-US-AnaNeuralSofter

Male voices

VoiceIDUsage exampleDmitryen-US-RogerNeuralMore bass How to change voice: cat ~/.openclaw/openclaw.json | \ jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json systemctl --user restart openclaw-gateway

Adjusting speed, pitch, volume

{ "messages": { "tts": { "edge": { "voice": "en-US-JennyNeural", "lang": "en-US", "rate": "+10%", // Speed: -50% to +100% "pitch": "-5%", // Pitch: -50% to +50% "volume": "+5%" // Volume: -100% to +100% } } } }

Problem: Voice not transcribed

Logs show: [ERROR] Transcription failed Possible causes: File too large โ€” > 20 MB # Solution: Increase maxBytes in config maxBytes: 52428800 # 50 MB Timeout โ€” transcription took > 2 minutes # Solution: Increase timeoutSeconds timeoutSeconds: 180 # 3 minutes Model not downloaded โ€” first run # Solution: Wait while it downloads (1-2 minutes) # Models are cached in ~/.cache/huggingface/

Problem: No voice reply

Possible causes: Reply too short (< 10 characters) TTS skips very short replies Solution: this is expected behavior auto: "inbound" but text message TTS in inbound mode replies with voice only on voice messages Text messages get text replies โ€” this is correct! Edge TTS unavailable # Check curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100 # If error โ€” temporarily unavailable

Transcription time (Raspberry Pi 4/ARM)

Whisper ModelEst. timeQualitytiny~5-10 secLowbase~10-20 secMediumsmall~20-40 secHigh โ† currentmedium~40-80 secVery highlarge~80-160 secMaximum Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.

Where Whisper models are stored

~/.cache/huggingface/ Models download automatically on first run.

Done! ๐ŸŽ‰

After completing these steps: โœ… faster-whisper installed in venv โœ… transcribe.py script created โœ… OpenClaw configured (STT + TTS) โœ… Gateway restarted โœ… Voice messages working Now your Telegram bot: ๐ŸŽ™๏ธ Accepts voice โ†’ transcribes via faster-whisper ๐ŸŽค Replies with voice โ†’ generates via Edge TTS ๐Ÿ’ฌ Accepts text โ†’ replies with text (as usual) Useful links: OpenClaw docs: https://docs.openclaw.ai TTS docs: https://docs.openclaw.ai/tts Audio docs: https://docs.openclaw.ai/nodes/audio Install skills: npx clawhub search voice Created: 2026-03-01 for OpenClaw 2026.2.26

Category context

Messaging, meetings, inboxes, CRM, and teammate communication surfaces.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
1 Docs
  • SKILL.md Primary doc