# Send Voice messaging setup to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "voice-stt-tts",
    "name": "Voice messaging setup",
    "source": "tencent",
    "type": "skill",
    "category": "通讯协作",
    "sourceUrl": "https://clawhub.ai/aksenkin/voice-stt-tts",
    "canonicalUrl": "https://clawhub.ai/aksenkin/voice-stt-tts",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/voice-stt-tts",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=voice-stt-tts",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "SKILL.md"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/voice-stt-tts"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/voice-stt-tts",
    "downloadUrl": "https://openagent3.xyz/downloads/voice-stt-tts",
    "agentUrl": "https://openagent3.xyz/skills/voice-stt-tts/agent",
    "manifestUrl": "https://openagent3.xyz/skills/voice-stt-tts/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/voice-stt-tts/agent.md"
  }
}
```
## Documentation

### Voice Messages (STT + TTS) for OpenClaw 🎙️

Complete voice message setup using faster-whisper for transcription and Edge TTS for voice replies.

### What we configure

✅ STT (Speech-to-Text) — transcribe voice messages via faster-whisper
✅ TTS (Text-to-Speech) — voice replies via Edge TTS
🎯 Result: voice → text → reply with voice

### 1. Create virtual environment (venv)

For Ubuntu create an isolated venv:

python3 -m venv ~/.openclaw/workspace/voice-messages

### 2. Install faster-whisper

Install packages in venv:

~/.openclaw/workspace/voice-messages/bin/pip install faster-whisper

What gets installed:

faster-whisper — Python library for transcription
Dependencies: ctranslate2, onnxruntime, huggingface-hub, av, numpy, and others.
Size: ~250 MB

### Path and content

File: ~/.openclaw/workspace/voice-messages/transcribe.py

#!/usr/bin/env python3
import argparse
from faster_whisper import WhisperModel


def transcribe(audio_path: str, model_name: str = "small", lang: str = "en", device: str = "cpu") -> str:
    model = WhisperModel(
        model_name,
        device=device,
        compute_type="int8" if device == "cpu" else "float16",
    )
    segments, _ = model.transcribe(audio_path, language=lang, vad_filter=True)
    text = " ".join(seg.text.strip() for seg in segments if seg.text and seg.text.strip()).strip()
    return text


def main():
    p = argparse.ArgumentParser()
    p.add_argument("--audio", required=True)
    p.add_argument("--model", default="small")
    p.add_argument("--lang", default="en")
    p.add_argument("--device", default="cpu", choices=["cpu", "cuda"])
    args = p.parse_args()

    text = transcribe(args.audio, args.model, args.lang, args.device)
    print(text if text else "")


if __name__ == "__main__":
    main()

What the script does:

Accepts audio file path (--audio)
Loads Whisper model (--model): small by default
Sets language (--lang): en for English
Transcribes with VAD filter (Voice Activity Detection)
Outputs clean text to stdout

### Make file executable:

chmod +x ~/.openclaw/workspace/voice-messages/transcribe.py

### 1. Configure STT (tools.media.audio)

Add to ~/.openclaw/openclaw.json:

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    }
  }
}

Parameters:

ParameterValueDescriptionenabledtrueEnable audio transcriptionmaxBytes20971520Max file size (20 MB)type"cli"Model type: CLI commandcommandPython pathPath to python in venvargsargument arrayArguments for script{{MediaPath}}placeholderReplaced with audio file pathtimeoutSeconds120Transcription timeout (2 minutes)

### 2. Configure TTS (messages.tts)

Add to ~/.openclaw/openclaw.json:

{
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    }
  }
}

Parameters:

ParameterValueDescriptionauto"inbound"Key mode! — reply with voice only on incoming voice messagesprovider"edge"TTS provider (free, no API key)voice"en-US-JennyNeural"Voice (see available below)lang"en-US"Locale (en-US for US english)

### 3. Full configuration example

{
  "tools": {
    "media": {
      "audio": {
        "enabled": true,
        "maxBytes": 20971520,
        "models": [
          {
            "type": "cli",
            "command": "~/.openclaw/workspace/voice-messages/bin/python",
            "args": [
              "~/.openclaw/workspace/voice-messages/transcribe.py",
              "--audio",
              "{{MediaPath}}",
              "--lang",
              "en",
              "--model",
              "small"
            ],
            "timeoutSeconds": 120
          }
        ]
      }
    },
  },
  "messages": {
    "tts": {
      "auto": "inbound",
      "provider": "edge",
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US"
      }
    },
    "ackReactionScope": "group-mentions"
  }
}

### Restart Gateway

# Method 1: via openclaw CLI
openclaw gateway restart

# Method 2: via systemd
systemctl --user restart openclaw-gateway

# Check status
systemctl --user status openclaw-gateway
# Should show: active (running)

### Test STT (transcription)

Action: Send a voice message to your Telegram bot

Expected result:

[Audio] User text: [Telegram ...] <media:audio> Transcript: <transcribed text>

Example response:

[Audio] User text: [Telegram kd (@someuser) id:12345678 +5s ...] <media:audio> Transcript: Hello. How are you?

### Test TTS (voice replies)

Action: After successful transcription, bot should send a voice reply

Expected result:

Voice file arrives in Telegram
Voice note (round bubble)

Expected behavior:

Incoming voice → bot replies with voice
Text messages → bot replies with text (this is normal!)

### Female voices

VoiceIDUsage exampleJennyen-US-JennyNeural← currentAnaen-US-AnaNeuralSofter

### Male voices

VoiceIDUsage exampleDmitryen-US-RogerNeuralMore bass

How to change voice:

cat ~/.openclaw/openclaw.json | \\
  jq '.messages.tts.edge.voice = "en-US-MichelleNeural"' > ~/.openclaw/openclaw.json.tmp
mv ~/.openclaw/openclaw.json.tmp ~/.openclaw/openclaw.json
systemctl --user restart openclaw-gateway

### Adjusting speed, pitch, volume

{
  "messages": {
    "tts": {
      "edge": {
        "voice": "en-US-JennyNeural",
        "lang": "en-US",
        "rate": "+10%",      // Speed: -50% to +100%
        "pitch": "-5%",     // Pitch: -50% to +50%
        "volume": "+5%"     // Volume: -100% to +100%
      }
    }
  }
}

### Problem: Voice not transcribed

Logs show:

[ERROR] Transcription failed

Possible causes:

File too large — > 20 MB
# Solution: Increase maxBytes in config
maxBytes: 52428800  # 50 MB


Timeout — transcription took > 2 minutes
# Solution: Increase timeoutSeconds
timeoutSeconds: 180  # 3 minutes


Model not downloaded — first run
# Solution: Wait while it downloads (1-2 minutes)
# Models are cached in ~/.cache/huggingface/

### Problem: No voice reply

Possible causes:

Reply too short (< 10 characters)

TTS skips very short replies
Solution: this is expected behavior


auto: "inbound" but text message

TTS in inbound mode replies with voice only on voice messages
Text messages get text replies — this is correct!


Edge TTS unavailable
# Check
curl -s "https://speech.platform.bing.com/consumer/api/v1/tts" | head -c 100
# If error — temporarily unavailable

### Transcription time (Raspberry Pi 4/ARM)

Whisper ModelEst. timeQualitytiny~5-10 secLowbase~10-20 secMediumsmall~20-40 secHigh ← currentmedium~40-80 secVery highlarge~80-160 secMaximum

Recommendation: For Raspberry Pi use small or base. medium/large will be very slow.

### Where Whisper models are stored

~/.cache/huggingface/

Models download automatically on first run.

### Done! 🎉

After completing these steps:

✅ faster-whisper installed in venv
✅ transcribe.py script created
✅ OpenClaw configured (STT + TTS)
✅ Gateway restarted
✅ Voice messages working

Now your Telegram bot:

🎙️ Accepts voice → transcribes via faster-whisper
🎤 Replies with voice → generates via Edge TTS
💬 Accepts text → replies with text (as usual)

Useful links:

OpenClaw docs: https://docs.openclaw.ai
TTS docs: https://docs.openclaw.ai/tts
Audio docs: https://docs.openclaw.ai/nodes/audio
Install skills: npx clawhub search voice

Created: 2026-03-01 for OpenClaw 2026.2.26
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: aksenkin
- Version: 1.0.3
## Source health
- Status: healthy
- Source download looks usable.
- Yavira can redirect you to the upstream package for this source.
- Health scope: source
- Reason: direct_download_ok
- Checked at: 2026-04-30T16:55:25.780Z
- Expires at: 2026-05-07T16:55:25.780Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/voice-stt-tts)
- [Send to Agent page](https://openagent3.xyz/skills/voice-stt-tts/agent)
- [JSON manifest](https://openagent3.xyz/skills/voice-stt-tts/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/voice-stt-tts/agent.md)
- [Download page](https://openagent3.xyz/downloads/voice-stt-tts)