Tencent SkillHub · Developer Tools

mmVoiceMaker

Enables voice synthesis, voice cloning, voice design, and audio post-processing using MiniMax Voice API and FFmpeg. Use when converting text to speech, creat...

skill openclawclawhub Free

0 Downloads

0 Stars

0 Installs

0 Score

High Signal

Enables voice synthesis, voice cloning, voice design, and audio post-processing using MiniMax Voice API and FFmpeg. Use when converting text to speech, creat...

⬇ 0 downloads ★ 0 stars Unverified but indexed

Install for OpenClaw

Quick setup

Download the package from Yavira.
Extract the archive and review SKILL.md first.
Import or place the package into your OpenClaw setup.

Requirements

Target platform: OpenClaw
Install method: Manual import
Extraction: Extract archive
Prerequisites: OpenClaw
Primary doc: SKILL.md

Package facts

Download mode: Yavira redirect
Package format: ZIP package
Source platform: Tencent SkillHub
What's included: SKILL.md, check_environment.py, mmvoice.py, reference/api_documentation.md, reference/audio-guide.md, reference/cli-guide.md

Validation

Use the Yavira download entry.
Review SKILL.md after the package is downloaded.
Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

Download the package from Yavira.
Extract it into a folder your agent can access.
Paste one of the prompts below and point your agent at the extracted folder.

New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.

Open Send to Agent page Open JSON manifest Open Markdown brief

Trust & source

Release facts

Source: Tencent SkillHub
Verification: Indexed source record
Version: 1.0.1

Provenance

Publisher: BLUE-coconut
Source page: View original listing
Canonical URL: Open canonical page

Documentation

ClawHub primary doc Primary doc: SKILL.md 20 sections Open source page

MiniMax Voice Maker

Professional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg.

Capabilities

AreaFeaturesTTSSync (HTTP/WebSocket), async (long text), streamingSegment-basedMulti-voice, multi-emotion synthesis from segments.json, auto mergeVoiceCloning (10s–5min), design (text prompt), managementAudioFormat conversion, merge, normalize, trim, remove silence (FFmpeg)

File structure:

mmVoice_Maker/ ├── SKILL.md # This overview ├── mmvoice.py # CLI tool (recommended for Agents) ├── check_environment.py # Environment verification ├── requirements.txt ├── scripts/ # Entry: scripts/__init__.py │ ├── utils.py # Config, data classes │ ├── sync_tts.py # HTTP/WebSocket TTS │ ├── async_tts.py # Long text TTS │ ├── segment_tts.py # Segment-based TTS (multi-voice, multi-emotion) │ ├── voice_clone.py # Voice cloning │ ├── voice_design.py # Voice design │ ├── voice_management.py # List/delete voices │ └── audio_processing.py # FFmpeg audio tools └── reference/ # Load as needed ├── cli-guide.md # CLI usage guide ├── getting-started.md # Setup and quick test ├── tts-guide.md # Sync/async TTS workflows ├── voice-guide.md # Clone/design/manage ├── audio-guide.md # Audio processing ├── script-examples.md # Runnable code snippets ├── troubleshooting.md # Common issues ├── api_documentation.md # Complete API reference └── voice_catalog.md # Voice selection guide

Main Workflow Guideline (Text to Speech)

6-step workflow: [step1]. Verify environment [step2-preparation]⚠️NOTE: Before processing the text, you must read voice-catalog.md for voice selection. [step2]. Process text into script → <cwd>/audio/segments.json. Note: [Step2.4] is really important, you must check it twice before sending the script to the user. [step2.5]. ⚠️ Generate preview for user confirmation (highly recommended for multi-voice content) [step3]. Present plan to user for confirmation [step4]. Validate segments.json [step5]. Generate and merge audio → intermediate files in <cwd>/audio/tmp/, final output in <cwd>/audio/output.mp3 [step6]. ⚠️ CRITICAL: User confirms audio quality FIRST → THEN cleanup temp files (only after user is satisfied) <cwd> is Claude's current working directory (not the skill directory). Audio files are saved relative to where Claude is running commands.

Step 1: Verify environment

python check_environment.py Checks: Python 3.8+ Required packages (requests, websockets) FFmpeg installation MINIMAX_VOICE_API_KEY environment variable If API key is not set, ask user for keys and set it: export MINIMAX_VOICE_API_KEY="your-api-key-here"

Step 2: Decision and Pre-processing

⚠️ MOST IMPORTANT PRINCIPLE: Gender Matching First
Before selecting voices, you MUST always match gender first. This is non-negotiable.
Golden Rule:
If a character is male → use male voice
If a character is female → use female voice
If a character is neutral/other → choose appropriate neutral voice
Why this matters:
Violating gender matching (e.g., male character with female voice) breaks immersion
Even if personality traits match, gender comes first
This is especially critical for classic literature, historical content, and professional narration
Examples:
CharacterWrong VoiceCorrect Voice唐三藏 (male monk)female-yujie ❌Chinese (Mandarin)_Gentleman ✅林黛玉 (female)male-qn-badao ❌female-shaonv ✅曹操 (male warlord)female-chengshu ❌Chinese (Mandarin)_Unrestrained_Young_Man ✅
Decision guide:
Evaluate based on:
Does the user specify a model? → Use that model, or use the default one "speech-2.8"
Is multi-voice needed? → Different voice_id per speaker/character
For speech-2.8: emotion is auto-matched (leave emotion empty)
For older models: manually specify emotion tags
Use case scenarios:
ScenarioDescriptionSegmentsVoice SelectionSingle VoiceUser needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment).Split by length onlyOne voice_id for all segmentsMulti-VoiceMultiple characters/speakers, each with different voice. Segment by speaker/role changes.Split by logical unit (speaker, dialogue, etc.)Different voice_id per rolePodcast/InterviewHost and guest speakers with distinct voices.Split by speakerVoice per host/guestAudiobook/FictionNarrator and character voices.Split by narration vs. dialogueVoice per narrator/characterDocumentaryMostly narration with occasional quotes.Keep as one segmentSingle narrator voiceReport/AnnouncementFormal content with consistent tone.Keep as one segmentProfessional voice
Processing Workflow (4 sub-steps):
Step 2.1: Text Segmentation and Role Analysis
First, segment your text into logical units and identify the role/character for each segment.
Key principle (Important!): Split by logical unit, NOT simply by sentence
When to split (Important!):
Different speakers clearly marked
Narrator vs. character dialogue (in fiction/audiobooks/interview etc.)
In some scenarios (like audiobooks, multi-voice fiction etc.), where speaker's identity is important, split when narration and dialogue mix in the same sentence.
When NOT to split (Important!):
Third-person narration like "John said..." or "The reporter noted..."
Quoted speech in narration (in documentary/podcast/report etc.) should keep in narrator's voice
Keep in narrator's voice unless specific characterization is needed
Decision depends on use case:
Use caseExampleSplit strategySingle VoiceLong article, news piece, announcementSplit by length (≤1,000,000 chars), same voice for allPodcast/Interview"Host: Welcome to the show. Guest: Thank you for having me."Split by speakerDocumentary narration"The scientist explained, 'The results are promising.'"Keep as one segment (narrator voice)Audiobook/Fiction"'Who's there?' she whispered."Split: "'Who's there?'" should be in character voice, while "she whispered." should be in narrator's voiceReport"According to the report, the economy is growing."Keep as one segment
Example1: Single Voice (speech-2.8)
For single-voice content (e.g., news, announcements, articles), segment only by length while maintaining the same voice:
[
{"text": "First part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
{"text": "Second part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""},
{"text": "Third part of the article (under 1,000,000 chars)...", "role": "narrator", "voice_id": "female-shaonv", "emotion": ""}
]
Example2: Audiobook with characters (speech-2.8)
In audiobooks (multi-voice fiction), split when narration and dialogue mix in the same sentence:
[
{"text": "The detective entered the room.", "role": "narrator", "voice_id": "", "emotion": ""},
{"text": "\"Who's there?\"", "role": "female_character", "voice_id": "", "emotion": ""},
{"text": "she whispered.", "role": "narrator", "voice_id": "", "emotion": ""},
{"text": "\"It's me,\"", "role": "male_character", "voice_id": "", "emotion": ""},
{"text": "he replied calmly.", "role": "narrator", "voice_id": "", "emotion": ""}
]
Example3: Documentary/podcast narration (speech-2.8)
Quoted speech in narration stays in narrator's voice (no need to split):
[
{
"text": "The scientist explained, \"The results show significant improvement in all test groups.\"",
"role": "narrator",
"voice_id": "",
"emotion": ""
},
{
"text": "According to the latest report, the economy has grown by 3% this quarter.",
"role": "narrator",
"voice_id": "",
"emotion": ""
}
]
**Note:** In the preliminary `segments.json`:
Fill in the `text` field with segment content
Fill in the `role` field to identify the character (narrator, male_character, female_character, host, guest, etc.)
Leave `voice_id` empty (to be filled in Step 2.2)
Leave `emotion` empty for speech-2.8 models
**Step 2.2: Voice Selection**
After segmenting and labeling roles, analyze all detected characters in your text. Consult [voice_catalog.md](reference/voice_catalog.md) **Section 1 "How to Choose a Voice"** to match voices to characters.
**⚠️ CRITICAL: Follow the two-step selection process below**
**Path A — Professional domains (Story/Narration, News/Announcements, Documentary):**
If the content belongs to one of these three professional domains, prioritize selecting from the recommended voices in **voice_catalog.md Section 2.1** (filter by scenario + gender). These voices are specifically optimized for their professional use cases.
**Path B — All other scenarios:**
Select from **voice_catalog.md Section 2.2**, following this strict priority hierarchy:
1. **First: Match Gender** (non-negotiable) — Male characters MUST use male voices, female characters MUST use female voices
2. **Second: Match Language** — The voice MUST match the content language (Chinese content → Chinese voice, Korean content → Korean voice, English content → English voice, etc.). Never assign a voice from the wrong language.
3. **Third: Match Age** — Determine the age group (Children / Youth / Adult / Elderly / Professional) and select from the corresponding subsection in Section 2.2
4. **Fourth: Match Personality & Role** — Choose the best fit based on personality traits, tone, and character role
**Voice Selection Decision Tree:**
Is this a professional domain (Story/News/Documentary)?
├── YES → Select from voice_catalog Section 2.1 (filter by scenario + gender)
└── NO → Select from voice_catalog Section 2.2:
Step 1: Match Gender
├── Male character → Male voices only
└── Female character → Female voices only
Step 2: Match Age Group
└── Children / Youth / Adult / Elderly / Professional
Step 3: Match Language
└── Filter to voices matching the content language
Step 4: Match Personality & Role
└── Choose best fit by tone, personality, character role
**Step 2.3: Emotions Segmentation** *(For non-2.8 series models only)*
For models other than speech-2.8 series, analyze emotions in your segments:
For **long segments**, split further based on **emotional transitions**
Add appropriate **emotion tags** to each segment
Refer to Section 3 in [text-processing.md](reference/text-processing.md) for emotion tags and examples
Skip this step for speech-2.8 models (emotion is auto-matched)
**Emotion Tags:**
For speech-2.6 series (speech-2.6-hd and speech-2.6-turbo): happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper
For older models: happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)
**Step 2.4: Check and Post-processing**
Finally, review and optimize your script:
Verify segment length limits (async TTS ≤1,000,000 characters)
Clean up conversational text (remove speaker names if needed)
Ensure consistency in voice and emotion tags
**Critical check for multi-voice content**: For audiobooks, multi-voice fiction, or content where dialogue is presented from a first-person perspective, verify that narration and dialogue mixed in the same sentence are properly split.
**When splitting IS needed (first-person dialogue in fiction/audiobooks):**
Example: `"John asked, 'Where are you going?'"` should be split into:
- Segment 1: `"John asked, "` - uses narrator voice (describes who is speaking)
- Segment 2: `"Where are you going?"` - uses the character's voice (actual dialogue in first-person)
This ensures proper voice differentiation: descriptive narration uses the narrator's voice, while the character's spoken words use the character's designated voice.
**When splitting is NOT needed (third-person quotes in podcast/documentary/news):**
In podcasts, documentaries, or news reports, quoted speech is typically presented in third-person narrative style - the speaker's words are being reported, not performed. Keep these as one segment with the narrator's voice and remove the speaker's name at the beginning:
- `"Welcome to our show." → narrator voice, remove the speaker's name (like "The host said:") at the beginning
- `"According to experts, 'This technology represents a significant breakthrough.'" → keep as one segment (narrator voice)
- `"Scientists noted, 'The experimental results exceeded our expectations.'" → keep as one segment (narrator voice)
**If the split is missing**: Go back to Step 2.1 and ensure dialogue portions are separated from narration with appropriate role labels.
**Create segments.json:**
After completing all 4 sub-steps, save the final `segments.json` to `<cwd>/audio/segments.json`.
### Step 2.5: Generate Preview for User Confirmation (Highly Recommended)
**For multi-voice content (audiobooks, dramas, etc.), always generate a preview first.**
This saves time and prevents waste when voice selections need adjustment.
**How to generate a preview:**
1. Create a smaller segments file with 10-20 representative segments (include all characters)
2. Generate the preview audio
3. Ask user to listen and confirm voice choices
**Preview segments.json example:**
```json
[
{"text": "Narration opening...", "role": "narrator", "voice_id": "...", "emotion": ""},
{"text": "Male character speaks...", "role": "male_character", "voice_id": "...", "emotion": ""},
{"text": "Female character speaks...", "role": "female_character", "voice_id": "...", "emotion": ""},
{"text": "More dialogue...", "role": "...", "voice_id": "...", "emotion": ""}
]
Preview command:
python mmvoice.py generate segments_preview.json -o preview.mp3
When user confirms preview:
Use the same voice selections for the full segments.json
No need to re-select voices

Step 3: Present plan to user for confirmation

Before proceeding to validation and generation, present the segmentation plan to the user and wait for confirmation:
Present to the user:
Roles identified: List all characters/speakers in the text
Voice assignments: Show which voice_id is assigned to each role (include voice characteristics from voice_catalog.md)
Model being used: Explain why this model was selected
Language: Confirm the primary language of the content
Emotion approach: Auto-matched (speech-2.8) or manual tags (older models)
Example confirmation message:
I've analyzed the text and created a segmentation plan:
**Roles and Voices:**
Narrator: male-qn-jingying (deep, authoritative, suitable for storytelling)
Protagonist: female-shaonv (bright, energetic, youthful)
Antagonist: male-qn-qingse (cool, menacing)
**Model:** speech-2.8-hd (recommended - automatic emotion matching)
**Language:** Chinese
**Segments:** 8 segments total
Please review and confirm:
1. ⚠️ **Gender Verification**: Do the voice genders match the character genders?
- [Narrator: Male ✓] [Protagonist: Female ✓] [Antagonist: Male ✓]
2. ⚠️ **Language Verification**: Do the voice languages match the content language?
- [All voices: Chinese ✓]
3. Are the voice assignments appropriate for each character (age, personality)?
4. Should any segments be combined or split differently?
5. Any other changes you'd like to make?
**After generation:**
I'll generate a preview first for you to review
Only after you confirm the audio quality will I clean up temporary files
If not satisfied, I'll re-generate and we iterate until you're happy
Reply "confirm" to proceed, or let me know what to adjust.
Wait for user response:
If user confirms → Proceed to Step 4 (validate)
If user suggests changes → Update segments.json and present the plan again for confirmation

Step 4: Validate segments.json (model, emotion, voice_id validation)

Before generating audio, validate the segments file: # Default: speech-2.8-hd (auto emotion matching) python mmvoice.py validate <cwd>/audio/segments.json # Specify model for context-specific validation python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd # Validate voice_ids against available voices (slower, requires API call) python mmvoice.py validate <cwd>/audio/segments.json --validate-voices # Combined options (recommended) python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices # Use `--verbose` to see segment details python mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices --verbose Emotion Validation checks: ModelEmotion Validationspeech-2.8-hd/turboEmotion can be empty (auto emotion matching)speech-2.6-hd/turboAll 9 emotions supportedOlder modelshappy, sad, angry, fearful, disgusted, surprised, calm (7 emotions) Voice ID validation: With --validate-voices: Calls API once to get all available voices Validates each voice_id against the list Shows errors for invalid voice_ids (blocks validation)

Step 5: Generate and merge audio

Generate audio for all segments and merge into final output. File placement (default behavior if user doesn't specify): <cwd>/ # Claude's current working directory └── audio/ # Created automatically ├── tmp/ # Intermediate segment files │ ├── segment_0000.mp3 │ ├── segment_0001.mp3 │ └── ... └── <custom_audio_name>.mp3 # Final merged audio, name can be customized Where <cwd> is Claude's current working directory (where commands are executed). If -o is not specified, output goes to <cwd>/audio/output.mp3 Intermediate files go to <cwd>/audio/tmp/ After user confirms the final audio, ask whether to delete <cwd>/audio/tmp/ Basic usage: # Default: speech-2.8-hd, output to <cwd>/audio/output.mp3 python mmvoice.py generate <cwd>/audio/segments.json # Specify output path python mmvoice.py generate <cwd>/audio/segments.json -o <cwd>/audio/<custom_audio_name>.mp3 # Specify model if needed python mmvoice.py generate <cwd>/audio/segments.json --model speech-2.6-hd Skip existing segments (for rate limit retries): # Only generate segments that don't exist yet - skips already-generated files python mmvoice.py generate <cwd>/audio/segments.json --skip-existing Error handling: If a segment fails, the script reports which segment and why Use --continue-on-error to generate remaining segments despite failures Use --skip-existing to skip already successfully generated segments (recommended for retries after rate limit) The script automatically uses fallback merging if FFmpeg filter_complex fails

Step 6: Confirm and cleanup

⚠️ CRITICAL: Never delete temp files until user confirms! After generation completes, you MUST follow this exact sequence: Step 6.1: Report generation result to user ✓ Audio saved to: <output_path> Generated: X/Y segments Intermediate files in: <cwd>/audio/tmp/ Step 6.2: Ask user to confirm audio quality Ask the user to listen to the audio and confirm: Is the audio quality satisfactory? Are all voices appropriate? Any adjustments needed? Step 6.3: Wait for user response Step 6.4: Only after user confirms, offer cleanup After confirming audio quality, temporary files can be deleted with: rm -rf <cwd>/audio/tmp/ NEVER execute rm -rf on temp files without explicit user confirmation! If user is NOT satisfied: Do NOT delete temp files Discuss what needs to be adjusted Re-generate affected segments if needed Ask for confirmation again

Other Usage

Use the following when the task involves voice creation, single-voice TTS (sync/async), or audio processing instead of the main segment-based workflow. Each subsection gives CLI commands, script paths, and the reference doc to open for details.

Voice creation (clone / design / list)

Purpose: Create custom voices from audio (clone) or from a text description (design); list system and custom voices. CLI (entry point: mmvoice.py): python mmvoice.py clone AUDIO_FILE --voice-id VOICE_ID # Clone from 10s–5min audio python mmvoice.py design "DESCRIPTION" --voice-id ID # Design from text python mmvoice.py list-voices # List all voices Scripts: scripts/voice_clone.py (clone), scripts/voice_design.py (design), scripts/voice_management.py (list/manage). Documentation: reference/voice-guide.md — cloning (quick + high-quality + step-by-step), design workflow, management.

Text-to-speech (sync / async)

Purpose: Single-voice TTS: sync for short text (≤10k chars), async for long text (up to 1M chars); optional streaming. CLI: python mmvoice.py tts "TEXT" -o OUTPUT.mp3 [-v VOICE_ID] [--model MODEL] Scripts: scripts/sync_tts.py (HTTP/WebSocket sync), scripts/async_tts.py (async task + poll). Documentation: reference/tts-guide.md — sync TTS, async TTS, streaming, segment-based production.

Audio processing (merge / convert / normalize)

Purpose: Merge files (with optional crossfade), convert format, normalize loudness, trim. CLI: python mmvoice.py merge FILE1 [FILE2 ...] -o OUTPUT [--crossfade MS] python mmvoice.py convert INPUT -o OUTPUT [--format FORMAT] Script: scripts/audio_processing.py (merge, convert, normalize, trim). Documentation: reference/audio-guide.md — format conversion, merging (filter_complex + concat demuxer fallback), normalization, trimming, optimization.

Segment-based TTS (main workflow)

CLI: validate and generate as in Steps 4–5 above. Script: scripts/segment_tts.py. Documentation: reference/cli-guide.md, reference/api_documentation.md.

Reference documents (on-demand)

Open these when you need concrete usage, parameters, or troubleshooting. Paths are relative to the skill root. DocumentContent for the Agentreference/cli-guide.mdAll CLI commands (validate, generate, tts, clone, design, list-voices, merge, convert, check-env) with options and examples. Use for correct CLI invocation.reference/getting-started.mdEnvironment setup (venv, pip install, FFmpeg), MINIMAX_VOICE_API_KEY, basic synthesis test. Use for first-time setup or “env not working”.reference/tts-guide.mdSync TTS (short text), async TTS (long text), streaming TTS, multi-segment production. Use for sync/async/streaming logic and parameters.reference/voice-guide.mdVoice cloning (quick, high-quality with prompt audio, step-by-step), voice design, voice management. Use for custom voice creation flows.reference/audio-guide.mdFormat conversion, merging (including crossfade and fallback), normalization, trimming, optimization. Use for merge/convert/normalize behavior and options.reference/script-examples.mdCopy-paste runnable examples for sync TTS, async TTS, segment-based TTS, audio processing, voice clone/design/management. Use for quick Python snippets.reference/troubleshooting.mdEnvironment (API key, FFmpeg), API errors, segment-based TTS, audio, voice. Use when an error message or unexpected behavior appears.reference/api_documentation.mdFull API reference: config, sync/async TTS, emotion parameter, segment-based TTS, voice clone/design/management, audio processing, common parameters, error handling. Use for exact function signatures and parameter details.reference/voice_catalog.mdSystem voices list (male/female/beta), selection guide, voice parameters, custom voices, voice IDs. Use to choose or look up voice_id.

Requirements

Python: 3.8 or higher API Key: MINIMAX_VOICE_API_KEY environment variable must be set FFmpeg: Required for audio processing (merge, convert, normalize) Install: brew install ffmpeg (macOS) or sudo apt install ffmpeg (Ubuntu)

Limits and constraints

Text length: Sync TTS ≤10,000 chars; async TTS ≤1,000,000 chars Voice cloning: Audio must be 10s–5min duration, ≤20MB, formats: mp3/wav/m4a Voice expiration: Custom voices (cloned/designed) expire after 7 days if not used with TTS

Special features

Pause insertion: Use <#x#> in text where x = pause duration in seconds (0.01–99.99) Example: "Hello<#1.5#>world" creates 1.5s pause between words Supported emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper speech-2.8: automatic matching; speech-2.6: all 9; older models: first 7

Troubleshooting

Run python check_environment.py to diagnose setup issues See troubleshooting.md for common problems and solutions Check getting-started.md for detailed setup instructions

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package

4 Docs2 Scripts

SKILL.md Primary doc
reference/api_documentation.md Docs
reference/audio-guide.md Docs
reference/cli-guide.md Docs
check_environment.py Scripts
mmvoice.py Scripts

Install for OpenClaw

Requirements

Package facts

Validation

Install with your agent

Trust & source

Release facts

Provenance

Documentation

MiniMax Voice Maker

Capabilities

File structure:

Main Workflow Guideline (Text to Speech)

Step 1: Verify environment

Step 2: Decision and Pre-processing

Step 3: Present plan to user for confirmation

Step 4: Validate segments.json (model, emotion, voice_id validation)

Step 5: Generate and merge audio

Step 6: Confirm and cleanup

Other Usage

Voice creation (clone / design / list)

Text-to-speech (sync / async)

Audio processing (merge / convert / normalize)

Segment-based TTS (main workflow)

Reference documents (on-demand)

Requirements

Limits and constraints

Special features

Troubleshooting

Package contents

Related skills

API Architect

The Dev Team

Web Platform Engineer