Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.
Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.
Generate speech fully offline on a Mac Produce narration, audiobooks, podcasts, or video voiceovers Create multilingual TTS with controllable style and emotion Clone any voice from a short audio sample Design custom voices from text descriptions
pip install mlx-audio brew install ffmpeg
python scripts/run_tts.py custom-voice \ --text "Hello, welcome to local text to speech." \ --voice Ryan \ --output output.wav
python scripts/run_tts.py custom-voice \ --text "Breaking news: local AI model achieves human-level speech." \ --voice Uncle_Fu \ --instruct "news anchor tone, calm and authoritative" \ --output news.wav
VariantModelSizeMemoryUse CaseCustomVoicemlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit~1GB~4GBBuilt-in voices + style control (recommended)VoiceDesignmlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit~2GB~5GBCreate voices from text descriptionsBasemlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit~1GB~4GBVoice cloning from reference audio
LanguageCodeNotesAuto-detectautoDefault, detects from textChineseChineseMandarinEnglishEnglishJapaneseJapaneseKoreanKoreanFrenchFrenchGermanGermanSpanishSpanishPortuguesePortugueseItalianItalianRussianRussian
VoiceLanguageCharacterVivianChineseFemale, bright, youngSerenaChineseFemale, gentle, softUncle_FuChineseMale, authoritative, news anchorDylanChineseMale, Beijing dialectEricChineseMale, Sichuan dialectRyanEnglishMale, energeticAidenEnglishMale, clear, neutralOno_AnnaJapaneseFemaleSoheeKoreanFemale Voice Selection Guide: ScenarioRecommended VoiceChinese news/narrationUncle_FuChinese casual/livelyEricChinese female, professionalVivianChinese female, storytellingSerenaEnglish energetic contentRyanEnglish neutral/educationalAidenJapanese contentOno_AnnaKorean contentSohee
Use built-in voices with optional emotion/style control via --instruct. python scripts/run_tts.py custom-voice \ --text "This is amazing news!" \ --voice Vivian \ --instruct "excited and happy" \ --output excited.wav Style instruction examples: "calm and warm" - Soft, friendly delivery "news anchor, authoritative" - Professional broadcast style "excited and energetic" - High energy, enthusiastic "sad and melancholic" - Emotional, somber tone "whispering, intimate" - Quiet, close-mic feel
Create a completely new voice by describing it in natural language. python scripts/run_tts.py voice-design \ --text "Welcome to our podcast." \ --instruct "warm, mature male narrator with low pitch and gentle tone" \ --output podcast_intro.wav Voice description examples: "young cheerful female with high pitch" "elderly wise male with deep resonant voice" "professional female news anchor, clear articulation" "friendly young male, casual and relaxed"
Clone any voice from a reference audio sample (5-10 seconds recommended). python scripts/run_tts.py voice-clone \ --text "This is my cloned voice speaking new content." \ --ref_audio reference.wav \ --ref_text "The exact transcript of the reference audio" \ --output cloned.wav Tips for voice cloning: Use clean audio without background noise 5-10 seconds of speech works best Provide accurate transcript of the reference Reference and output language should match
ParameterRequiredDefaultDescription--textYes-Text to synthesize--voiceNoVivianBuilt-in voice (CustomVoice only)--lang_codeNoautoLanguage code--instructNo-Style control or voice description--speedNo1.0Speech speed multiplier--temperatureNo0.7Sampling temperature (higher = more variation)--modelNo(per mode)Override default model--outputNo-Output file path--out-dirNo./outputsOutput directory when --output not set--ref_audioVoiceClone-Reference audio file--ref_textVoiceClone-Reference audio transcript
from mlx_audio.tts.generate import generate_audio # CustomVoice with style control generate_audio( text="Hello from Qwen3-TTS!", model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit", voice="Ryan", lang_code="english", instruct="friendly and warm", output_path=".", file_prefix="hello", audio_format="wav", join_audio=True, verbose=True, )
from mlx_audio.tts.utils import load import soundfile as sf import numpy as np # Load model model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit") # Generate audio (returns a generator) audio_chunks = [] for chunk in model.generate_custom_voice( text="Hello from Qwen3-TTS.", speaker="Ryan", language="english", instruct="clear, steady delivery" ): if hasattr(chunk, 'audio') and chunk.audio is not None: audio_chunks.append(chunk.audio) # Combine and save audio = np.concatenate(audio_chunks) sf.write("output.wav", audio, 24000)
from mlx_audio.tts.generate import generate_audio generate_audio( text="Welcome to the show.", model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit", instruct="warm, friendly female narrator with medium pitch", lang_code="english", output_path=".", file_prefix="voice_design", join_audio=True, )
from mlx_audio.tts.generate import generate_audio generate_audio( text="New content in the cloned voice.", model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit", ref_audio="reference.wav", ref_text="Transcript of the reference audio", output_path=".", file_prefix="cloned", join_audio=True, )
Use scripts/batch_dubbing.py for processing multiple lines: python scripts/batch_dubbing.py \ --input dubbing.json \ --out-dir outputs See references/dubbing_format.md for the JSON format.
MetricValueSample rate24,000 HzReal-time factor~0.7x (faster than real-time)Peak memory~4-6 GBFirst runDownloads model (~1-2GB)
IssueSolutionSlow generationUse 4-bit CustomVoice modelUnnatural pausesAdd punctuation, keep sentences shortWrong language detectedSpecify --lang_code explicitlyVoice cloning qualityUse cleaner reference audio, accurate transcriptTokenizer warningsHarmless, can be ignoredOut of memoryClose other apps, use 4-bit model
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.