{
  "schemaVersion": "1.0",
  "item": {
    "slug": "mm-voice-maker",
    "name": "mmVoiceMaker",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/BLUE-coconut/mm-voice-maker",
    "canonicalUrl": "https://clawhub.ai/BLUE-coconut/mm-voice-maker",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/mm-voice-maker",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=mm-voice-maker",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "check_environment.py",
      "mmvoice.py",
      "reference/api_documentation.md",
      "reference/audio-guide.md",
      "reference/cli-guide.md"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-07T17:22:31.273Z",
      "expiresAt": "2026-05-14T17:22:31.273Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=afrexai-annual-report",
        "contentDisposition": "attachment; filename=\"afrexai-annual-report-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/mm-voice-maker"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/mm-voice-maker",
    "agentPageUrl": "https://openagent3.xyz/skills/mm-voice-maker/agent",
    "manifestUrl": "https://openagent3.xyz/skills/mm-voice-maker/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/mm-voice-maker/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "MiniMax Voice Maker",
        "body": "Professional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg."
      },
      {
        "title": "Capabilities",
        "body": "AreaFeaturesTTSSync (HTTP/WebSocket), async (long text), streamingSegment-basedMulti-voice, multi-emotion synthesis from segments.json, auto mergeVoiceCloning (10s–5min), design (text prompt), managementAudioFormat conversion, merge, normalize, trim, remove silence (FFmpeg)"
      },
      {
        "title": "File structure:",
        "body": "mmVoice_Maker/\n├── SKILL.md                       # This overview\n├── mmvoice.py                     # CLI tool (recommended for Agents)\n├── check_environment.py           # Environment verification\n├── requirements.txt\n├── scripts/                       # Entry: scripts/__init__.py\n│   ├── utils.py                   # Config, data classes\n│   ├── sync_tts.py                # HTTP/WebSocket TTS\n│   ├── async_tts.py               # Long text TTS\n│   ├── segment_tts.py             # Segment-based TTS (multi-voice, multi-emotion)\n│   ├── voice_clone.py             # Voice cloning\n│   ├── voice_design.py            # Voice design\n│   ├── voice_management.py        # List/delete voices\n│   └── audio_processing.py        # FFmpeg audio tools\n└── reference/                     # Load as needed\n    ├── cli-guide.md               # CLI usage guide\n    ├── getting-started.md         # Setup and quick test\n    ├── tts-guide.md               # Sync/async TTS workflows\n    ├── voice-guide.md             # Clone/design/manage\n    ├── audio-guide.md             # Audio processing\n    ├── script-examples.md         # Runnable code snippets\n    ├── troubleshooting.md         # Common issues\n    ├── api_documentation.md       # Complete API reference\n    └── voice_catalog.md           # Voice selection guide"
      },
      {
        "title": "Main Workflow Guideline (Text to Speech)",
        "body": "6-step workflow:\n[step1]. Verify environment\n\n[step2-preparation]⚠️NOTE: Before processing the text, you must read voice-catalog.md for voice selection.\n\n[step2]. Process text into script → <cwd>/audio/segments.json. Note: [Step2.4] is really important, you must check it twice before sending the script to the user.\n\n[step2.5]. ⚠️ Generate preview for user confirmation (highly recommended for multi-voice content)\n\n[step3]. Present plan to user for confirmation\n\n[step4]. Validate segments.json\n\n[step5]. Generate and merge audio → intermediate files in <cwd>/audio/tmp/, final output in <cwd>/audio/output.mp3\n\n[step6]. ⚠️ CRITICAL: User confirms audio quality FIRST → THEN cleanup temp files (only after user is satisfied)\n\n<cwd> is Claude's current working directory (not the skill directory). Audio files are saved relative to where Claude is running commands."
      },
      {
        "title": "Step 1: Verify environment",
        "body": "python check_environment.py\n\nChecks:\n\nPython 3.8+\nRequired packages (requests, websockets)\nFFmpeg installation\nMINIMAX_VOICE_API_KEY environment variable\n\nIf API key is not set, ask user for keys and set it:\n\nexport MINIMAX_VOICE_API_KEY=\"your-api-key-here\""
      },
      {
        "title": "Step 2: Decision and Pre-processing",
        "body": "⚠️ MOST IMPORTANT PRINCIPLE: Gender Matching First\n\nBefore selecting voices, you MUST always match gender first. This is non-negotiable.\n\nGolden Rule:\n\nIf a character is male → use male voice\nIf a character is female → use female voice\nIf a character is neutral/other → choose appropriate neutral voice\n\nWhy this matters:\n\nViolating gender matching (e.g., male character with female voice) breaks immersion\nEven if personality traits match, gender comes first\nThis is especially critical for classic literature, historical content, and professional narration\n\nExamples:\n\nCharacterWrong VoiceCorrect Voice唐三藏 (male monk)female-yujie ❌Chinese (Mandarin)_Gentleman ✅林黛玉 (female)male-qn-badao ❌female-shaonv ✅曹操 (male warlord)female-chengshu ❌Chinese (Mandarin)_Unrestrained_Young_Man ✅\n\nDecision guide:\nEvaluate based on:\n\nDoes the user specify a model? → Use that model, or use the default one \"speech-2.8\"\nIs multi-voice needed? → Different voice_id per speaker/character\nFor speech-2.8: emotion is auto-matched (leave emotion empty)\nFor older models: manually specify emotion tags\n\nUse case scenarios:\n\nScenarioDescriptionSegmentsVoice SelectionSingle VoiceUser needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment).Split by length onlyOne voice_id for all segmentsMulti-VoiceMultiple characters/speakers, each with different voice. Segment by speaker/role changes.Split by logical unit (speaker, dialogue, etc.)Different voice_id per rolePodcast/InterviewHost and guest speakers with distinct voices.Split by speakerVoice per host/guestAudiobook/FictionNarrator and character voices.Split by narration vs. dialogueVoice per narrator/characterDocumentaryMostly narration with occasional quotes.Keep as one segmentSingle narrator voiceReport/AnnouncementFormal content with consistent tone.Keep as one segmentProfessional voice\n\nProcessing Workflow (4 sub-steps):\n\nStep 2.1: Text Segmentation and Role Analysis\nFirst, segment your text into logical units and identify the role/character for each segment.\n\nKey principle (Important!): Split by logical unit, NOT simply by sentence\n\nWhen to split (Important!):\n\nDifferent speakers clearly marked\nNarrator vs. character dialogue (in fiction/audiobooks/interview etc.)\nIn some scenarios (like audiobooks, multi-voice fiction etc.), where speaker's identity is important, split when narration and dialogue mix in the same sentence.\n\nWhen NOT to split (Important!):\n\nThird-person narration like \"John said...\" or \"The reporter noted...\"\nQuoted speech in narration (in documentary/podcast/report etc.) should keep in narrator's voice\nKeep in narrator's voice unless specific characterization is needed\n\nDecision depends on use case:\n\nUse caseExampleSplit strategySingle VoiceLong article, news piece, announcementSplit by length (≤1,000,000 chars), same voice for allPodcast/Interview\"Host: Welcome to the show. Guest: Thank you for having me.\"Split by speakerDocumentary narration\"The scientist explained, 'The results are promising.'\"Keep as one segment (narrator voice)Audiobook/Fiction\"'Who's there?' she whispered.\"Split: \"'Who's there?'\" should be in character voice, while \"she whispered.\" should be in narrator's voiceReport\"According to the report, the economy is growing.\"Keep as one segment\n\nExample1: Single Voice (speech-2.8)\nFor single-voice content (e.g., news, announcements, articles), segment only by length while maintaining the same voice:\n\n[\n  {\"text\": \"First part of the article (under 1,000,000 chars)...\", \"role\": \"narrator\", \"voice_id\": \"female-shaonv\", \"emotion\": \"\"},\n  {\"text\": \"Second part of the article (under 1,000,000 chars)...\", \"role\": \"narrator\", \"voice_id\": \"female-shaonv\", \"emotion\": \"\"},\n  {\"text\": \"Third part of the article (under 1,000,000 chars)...\", \"role\": \"narrator\", \"voice_id\": \"female-shaonv\", \"emotion\": \"\"}\n]\n\nExample2: Audiobook with characters (speech-2.8)\nIn audiobooks (multi-voice fiction), split when narration and dialogue mix in the same sentence:\n\n[\n  {\"text\": \"The detective entered the room.\", \"role\": \"narrator\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"\\\"Who's there?\\\"\", \"role\": \"female_character\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"she whispered.\", \"role\": \"narrator\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"\\\"It's me,\\\"\", \"role\": \"male_character\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"he replied calmly.\", \"role\": \"narrator\", \"voice_id\": \"\", \"emotion\": \"\"}\n]\n\nExample3: Documentary/podcast narration (speech-2.8)\nQuoted speech in narration stays in narrator's voice (no need to split):\n\n[\n  {\n    \"text\": \"The scientist explained, \\\"The results show significant improvement in all test groups.\\\"\",\n    \"role\": \"narrator\",\n    \"voice_id\": \"\",\n    \"emotion\": \"\"\n  },\n  {\n    \"text\": \"According to the latest report, the economy has grown by 3% this quarter.\",\n    \"role\": \"narrator\",\n    \"voice_id\": \"\",\n    \"emotion\": \"\"\n  }\n]\n\n**Note:** In the preliminary `segments.json`:\n- Fill in the `text` field with segment content\n- Fill in the `role` field to identify the character (narrator, male_character, female_character, host, guest, etc.)\n- Leave `voice_id` empty (to be filled in Step 2.2)\n- Leave `emotion` empty for speech-2.8 models\n\n\n**Step 2.2: Voice Selection**\n\nAfter segmenting and labeling roles, analyze all detected characters in your text. Consult [voice_catalog.md](reference/voice_catalog.md) **Section 1 \"How to Choose a Voice\"** to match voices to characters.\n\n**⚠️ CRITICAL: Follow the two-step selection process below**\n\n**Path A — Professional domains (Story/Narration, News/Announcements, Documentary):**\nIf the content belongs to one of these three professional domains, prioritize selecting from the recommended voices in **voice_catalog.md Section 2.1** (filter by scenario + gender). These voices are specifically optimized for their professional use cases.\n\n**Path B — All other scenarios:**\nSelect from **voice_catalog.md Section 2.2**, following this strict priority hierarchy:\n\n1. **First: Match Gender** (non-negotiable) — Male characters MUST use male voices, female characters MUST use female voices\n2. **Second: Match Language** — The voice MUST match the content language (Chinese content → Chinese voice, Korean content → Korean voice, English content → English voice, etc.). Never assign a voice from the wrong language.\n3. **Third: Match Age** — Determine the age group (Children / Youth / Adult / Elderly / Professional) and select from the corresponding subsection in Section 2.2\n4. **Fourth: Match Personality & Role** — Choose the best fit based on personality traits, tone, and character role\n\n**Voice Selection Decision Tree:**\n\nIs this a professional domain (Story/News/Documentary)?\n├── YES → Select from voice_catalog Section 2.1 (filter by scenario + gender)\n└── NO → Select from voice_catalog Section 2.2:\nStep 1: Match Gender\n├── Male character → Male voices only\n└── Female character → Female voices only\nStep 2: Match Age Group\n└── Children / Youth / Adult / Elderly / Professional\nStep 3: Match Language\n└── Filter to voices matching the content language\nStep 4: Match Personality & Role\n└── Choose best fit by tone, personality, character role\n\n**Step 2.3: Emotions Segmentation** *(For non-2.8 series models only)*\nFor models other than speech-2.8 series, analyze emotions in your segments:\n- For **long segments**, split further based on **emotional transitions**\n- Add appropriate **emotion tags** to each segment\n- Refer to Section 3 in [text-processing.md](reference/text-processing.md) for emotion tags and examples\n- Skip this step for speech-2.8 models (emotion is auto-matched)\n\n**Emotion Tags:**\n- For speech-2.6 series (speech-2.6-hd and speech-2.6-turbo): happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper\n- For older models: happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)\n\n\n**Step 2.4: Check and Post-processing**\nFinally, review and optimize your script:\n- Verify segment length limits (async TTS ≤1,000,000 characters)\n- Clean up conversational text (remove speaker names if needed)\n- Ensure consistency in voice and emotion tags\n- **Critical check for multi-voice content**: For audiobooks, multi-voice fiction, or content where dialogue is presented from a first-person perspective, verify that narration and dialogue mixed in the same sentence are properly split.\n\n  **When splitting IS needed (first-person dialogue in fiction/audiobooks):**\n  \n  Example: `\"John asked, 'Where are you going?'\"` should be split into:\n  - Segment 1: `\"John asked, \"` - uses narrator voice (describes who is speaking)\n  - Segment 2: `\"Where are you going?\"` - uses the character's voice (actual dialogue in first-person)\n\n  This ensures proper voice differentiation: descriptive narration uses the narrator's voice, while the character's spoken words use the character's designated voice.\n\n  **When splitting is NOT needed (third-person quotes in podcast/documentary/news):**\n  \n  In podcasts, documentaries, or news reports, quoted speech is typically presented in third-person narrative style - the speaker's words are being reported, not performed. Keep these as one segment with the narrator's voice and remove the speaker's name at the beginning:\n  \n  - `\"Welcome to our show.\" → narrator voice, remove the speaker's name (like \"The host said:\") at the beginning\n  - `\"According to experts, 'This technology represents a significant breakthrough.'\" → keep as one segment (narrator voice)\n  - `\"Scientists noted, 'The experimental results exceeded our expectations.'\" → keep as one segment (narrator voice)\n- **If the split is missing**: Go back to Step 2.1 and ensure dialogue portions are separated from narration with appropriate role labels.\n\n**Create segments.json:**\nAfter completing all 4 sub-steps, save the final `segments.json` to `<cwd>/audio/segments.json`.\n\n\n### Step 2.5: Generate Preview for User Confirmation (Highly Recommended)\n\n**For multi-voice content (audiobooks, dramas, etc.), always generate a preview first.**\n\nThis saves time and prevents waste when voice selections need adjustment.\n\n**How to generate a preview:**\n1. Create a smaller segments file with 10-20 representative segments (include all characters)\n2. Generate the preview audio\n3. Ask user to listen and confirm voice choices\n\n**Preview segments.json example:**\n```json\n[\n  {\"text\": \"Narration opening...\", \"role\": \"narrator\", \"voice_id\": \"...\", \"emotion\": \"\"},\n  {\"text\": \"Male character speaks...\", \"role\": \"male_character\", \"voice_id\": \"...\", \"emotion\": \"\"},\n  {\"text\": \"Female character speaks...\", \"role\": \"female_character\", \"voice_id\": \"...\", \"emotion\": \"\"},\n  {\"text\": \"More dialogue...\", \"role\": \"...\", \"voice_id\": \"...\", \"emotion\": \"\"}\n]\n\nPreview command:\n\npython mmvoice.py generate segments_preview.json -o preview.mp3\n\nWhen user confirms preview:\n\nUse the same voice selections for the full segments.json\nNo need to re-select voices"
      },
      {
        "title": "Step 3: Present plan to user for confirmation",
        "body": "Before proceeding to validation and generation, present the segmentation plan to the user and wait for confirmation:\n\nPresent to the user:\n\nRoles identified: List all characters/speakers in the text\nVoice assignments: Show which voice_id is assigned to each role (include voice characteristics from voice_catalog.md)\nModel being used: Explain why this model was selected\nLanguage: Confirm the primary language of the content\nEmotion approach: Auto-matched (speech-2.8) or manual tags (older models)\n\nExample confirmation message:\n\nI've analyzed the text and created a segmentation plan:\n\n**Roles and Voices:**\n- Narrator: male-qn-jingying (deep, authoritative, suitable for storytelling)\n- Protagonist: female-shaonv (bright, energetic, youthful)\n- Antagonist: male-qn-qingse (cool, menacing)\n\n**Model:** speech-2.8-hd (recommended - automatic emotion matching)\n**Language:** Chinese\n**Segments:** 8 segments total\n\nPlease review and confirm:\n1. ⚠️ **Gender Verification**: Do the voice genders match the character genders?\n   - [Narrator: Male ✓] [Protagonist: Female ✓] [Antagonist: Male ✓]\n2. ⚠️ **Language Verification**: Do the voice languages match the content language?\n   - [All voices: Chinese ✓]\n3. Are the voice assignments appropriate for each character (age, personality)?\n4. Should any segments be combined or split differently?\n5. Any other changes you'd like to make?\n\n**After generation:**\n- I'll generate a preview first for you to review\n- Only after you confirm the audio quality will I clean up temporary files\n- If not satisfied, I'll re-generate and we iterate until you're happy\n\nReply \"confirm\" to proceed, or let me know what to adjust.\n\nWait for user response:\n\nIf user confirms → Proceed to Step 4 (validate)\nIf user suggests changes → Update segments.json and present the plan again for confirmation"
      },
      {
        "title": "Step 4: Validate segments.json (model, emotion, voice_id validation)",
        "body": "Before generating audio, validate the segments file:\n\n# Default: speech-2.8-hd (auto emotion matching)\npython mmvoice.py validate <cwd>/audio/segments.json\n\n# Specify model for context-specific validation\npython mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd\n\n# Validate voice_ids against available voices (slower, requires API call)\npython mmvoice.py validate <cwd>/audio/segments.json --validate-voices\n\n# Combined options (recommended)\npython mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices\n\n# Use `--verbose` to see segment details\npython mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices --verbose\n\nEmotion Validation checks:\n\nModelEmotion Validationspeech-2.8-hd/turboEmotion can be empty (auto emotion matching)speech-2.6-hd/turboAll 9 emotions supportedOlder modelshappy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)\n\nVoice ID validation:\nWith --validate-voices:\n\nCalls API once to get all available voices\nValidates each voice_id against the list\nShows errors for invalid voice_ids (blocks validation)"
      },
      {
        "title": "Step 5: Generate and merge audio",
        "body": "Generate audio for all segments and merge into final output.\n\nFile placement (default behavior if user doesn't specify):\n\n<cwd>/                      # Claude's current working directory\n└── audio/                  # Created automatically\n    ├── tmp/                # Intermediate segment files\n    │   ├── segment_0000.mp3\n    │   ├── segment_0001.mp3\n    │   └── ...\n    └── <custom_audio_name>.mp3             # Final merged audio, name can be customized\n\nWhere <cwd> is Claude's current working directory (where commands are executed).\n\nIf -o is not specified, output goes to <cwd>/audio/output.mp3\nIntermediate files go to <cwd>/audio/tmp/\nAfter user confirms the final audio, ask whether to delete <cwd>/audio/tmp/\n\nBasic usage:\n\n# Default: speech-2.8-hd, output to <cwd>/audio/output.mp3\npython mmvoice.py generate <cwd>/audio/segments.json\n\n# Specify output path\npython mmvoice.py generate <cwd>/audio/segments.json -o <cwd>/audio/<custom_audio_name>.mp3\n\n# Specify model if needed\npython mmvoice.py generate <cwd>/audio/segments.json --model speech-2.6-hd\n\nSkip existing segments (for rate limit retries):\n\n# Only generate segments that don't exist yet - skips already-generated files\npython mmvoice.py generate <cwd>/audio/segments.json --skip-existing\n\nError handling:\n\nIf a segment fails, the script reports which segment and why\nUse --continue-on-error to generate remaining segments despite failures\nUse --skip-existing to skip already successfully generated segments (recommended for retries after rate limit)\nThe script automatically uses fallback merging if FFmpeg filter_complex fails"
      },
      {
        "title": "Step 6: Confirm and cleanup",
        "body": "⚠️ CRITICAL: Never delete temp files until user confirms!\n\nAfter generation completes, you MUST follow this exact sequence:\n\nStep 6.1: Report generation result to user\n\n✓ Audio saved to: <output_path>\n  Generated: X/Y segments\n  Intermediate files in: <cwd>/audio/tmp/\n\nStep 6.2: Ask user to confirm audio quality\nAsk the user to listen to the audio and confirm:\n\nIs the audio quality satisfactory?\nAre all voices appropriate?\nAny adjustments needed?\n\nStep 6.3: Wait for user response\n\nStep 6.4: Only after user confirms, offer cleanup\n\nAfter confirming audio quality, temporary files can be deleted with:\nrm -rf <cwd>/audio/tmp/\n\nNEVER execute rm -rf on temp files without explicit user confirmation!\n\nIf user is NOT satisfied:\n\nDo NOT delete temp files\nDiscuss what needs to be adjusted\nRe-generate affected segments if needed\nAsk for confirmation again"
      },
      {
        "title": "Other Usage",
        "body": "Use the following when the task involves voice creation, single-voice TTS (sync/async), or audio processing instead of the main segment-based workflow. Each subsection gives CLI commands, script paths, and the reference doc to open for details."
      },
      {
        "title": "Voice creation (clone / design / list)",
        "body": "Purpose: Create custom voices from audio (clone) or from a text description (design); list system and custom voices.\nCLI (entry point: mmvoice.py):\npython mmvoice.py clone AUDIO_FILE --voice-id VOICE_ID   # Clone from 10s–5min audio\npython mmvoice.py design \"DESCRIPTION\" --voice-id ID      # Design from text\npython mmvoice.py list-voices                             # List all voices\n\n\nScripts: scripts/voice_clone.py (clone), scripts/voice_design.py (design), scripts/voice_management.py (list/manage).\nDocumentation: reference/voice-guide.md — cloning (quick + high-quality + step-by-step), design workflow, management."
      },
      {
        "title": "Text-to-speech (sync / async)",
        "body": "Purpose: Single-voice TTS: sync for short text (≤10k chars), async for long text (up to 1M chars); optional streaming.\nCLI:\npython mmvoice.py tts \"TEXT\" -o OUTPUT.mp3 [-v VOICE_ID] [--model MODEL]\n\n\nScripts: scripts/sync_tts.py (HTTP/WebSocket sync), scripts/async_tts.py (async task + poll).\nDocumentation: reference/tts-guide.md — sync TTS, async TTS, streaming, segment-based production."
      },
      {
        "title": "Audio processing (merge / convert / normalize)",
        "body": "Purpose: Merge files (with optional crossfade), convert format, normalize loudness, trim.\nCLI:\npython mmvoice.py merge FILE1 [FILE2 ...] -o OUTPUT [--crossfade MS]\npython mmvoice.py convert INPUT -o OUTPUT [--format FORMAT]\n\n\nScript: scripts/audio_processing.py (merge, convert, normalize, trim).\nDocumentation: reference/audio-guide.md — format conversion, merging (filter_complex + concat demuxer fallback), normalization, trimming, optimization."
      },
      {
        "title": "Segment-based TTS (main workflow)",
        "body": "CLI: validate and generate as in Steps 4–5 above.\nScript: scripts/segment_tts.py.\nDocumentation: reference/cli-guide.md, reference/api_documentation.md."
      },
      {
        "title": "Reference documents (on-demand)",
        "body": "Open these when you need concrete usage, parameters, or troubleshooting. Paths are relative to the skill root.\n\nDocumentContent for the Agentreference/cli-guide.mdAll CLI commands (validate, generate, tts, clone, design, list-voices, merge, convert, check-env) with options and examples. Use for correct CLI invocation.reference/getting-started.mdEnvironment setup (venv, pip install, FFmpeg), MINIMAX_VOICE_API_KEY, basic synthesis test. Use for first-time setup or “env not working”.reference/tts-guide.mdSync TTS (short text), async TTS (long text), streaming TTS, multi-segment production. Use for sync/async/streaming logic and parameters.reference/voice-guide.mdVoice cloning (quick, high-quality with prompt audio, step-by-step), voice design, voice management. Use for custom voice creation flows.reference/audio-guide.mdFormat conversion, merging (including crossfade and fallback), normalization, trimming, optimization. Use for merge/convert/normalize behavior and options.reference/script-examples.mdCopy-paste runnable examples for sync TTS, async TTS, segment-based TTS, audio processing, voice clone/design/management. Use for quick Python snippets.reference/troubleshooting.mdEnvironment (API key, FFmpeg), API errors, segment-based TTS, audio, voice. Use when an error message or unexpected behavior appears.reference/api_documentation.mdFull API reference: config, sync/async TTS, emotion parameter, segment-based TTS, voice clone/design/management, audio processing, common parameters, error handling. Use for exact function signatures and parameter details.reference/voice_catalog.mdSystem voices list (male/female/beta), selection guide, voice parameters, custom voices, voice IDs. Use to choose or look up voice_id."
      },
      {
        "title": "Requirements",
        "body": "Python: 3.8 or higher\nAPI Key: MINIMAX_VOICE_API_KEY environment variable must be set\nFFmpeg: Required for audio processing (merge, convert, normalize)\n\nInstall: brew install ffmpeg (macOS) or sudo apt install ffmpeg (Ubuntu)"
      },
      {
        "title": "Limits and constraints",
        "body": "Text length: Sync TTS ≤10,000 chars; async TTS ≤1,000,000 chars\nVoice cloning: Audio must be 10s–5min duration, ≤20MB, formats: mp3/wav/m4a\nVoice expiration: Custom voices (cloned/designed) expire after 7 days if not used with TTS"
      },
      {
        "title": "Special features",
        "body": "Pause insertion: Use <#x#> in text where x = pause duration in seconds (0.01–99.99)\n\nExample: \"Hello<#1.5#>world\" creates 1.5s pause between words\n\n\nSupported emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper\n\nspeech-2.8: automatic matching; speech-2.6: all 9; older models: first 7"
      },
      {
        "title": "Troubleshooting",
        "body": "Run python check_environment.py to diagnose setup issues\nSee troubleshooting.md for common problems and solutions\nCheck getting-started.md for detailed setup instructions"
      }
    ],
    "body": "MiniMax Voice Maker\n\nProfessional text-to-speech skill with emotion detection, voice cloning, and audio processing capabilities powered by MiniMax Voice API and FFmpeg.\n\nCapabilities\nArea\tFeatures\nTTS\tSync (HTTP/WebSocket), async (long text), streaming\nSegment-based\tMulti-voice, multi-emotion synthesis from segments.json, auto merge\nVoice\tCloning (10s–5min), design (text prompt), management\nAudio\tFormat conversion, merge, normalize, trim, remove silence (FFmpeg)\nFile structure:\nmmVoice_Maker/\n├── SKILL.md                       # This overview\n├── mmvoice.py                     # CLI tool (recommended for Agents)\n├── check_environment.py           # Environment verification\n├── requirements.txt\n├── scripts/                       # Entry: scripts/__init__.py\n│   ├── utils.py                   # Config, data classes\n│   ├── sync_tts.py                # HTTP/WebSocket TTS\n│   ├── async_tts.py               # Long text TTS\n│   ├── segment_tts.py             # Segment-based TTS (multi-voice, multi-emotion)\n│   ├── voice_clone.py             # Voice cloning\n│   ├── voice_design.py            # Voice design\n│   ├── voice_management.py        # List/delete voices\n│   └── audio_processing.py        # FFmpeg audio tools\n└── reference/                     # Load as needed\n    ├── cli-guide.md               # CLI usage guide\n    ├── getting-started.md         # Setup and quick test\n    ├── tts-guide.md               # Sync/async TTS workflows\n    ├── voice-guide.md             # Clone/design/manage\n    ├── audio-guide.md             # Audio processing\n    ├── script-examples.md         # Runnable code snippets\n    ├── troubleshooting.md         # Common issues\n    ├── api_documentation.md       # Complete API reference\n    └── voice_catalog.md           # Voice selection guide\n\nMain Workflow Guideline (Text to Speech)\n\n6-step workflow: [step1]. Verify environment\n\n[step2-preparation]⚠️NOTE: Before processing the text, you must read voice-catalog.md for voice selection.\n\n[step2]. Process text into script → <cwd>/audio/segments.json. Note: [Step2.4] is really important, you must check it twice before sending the script to the user.\n\n[step2.5]. ⚠️ Generate preview for user confirmation (highly recommended for multi-voice content)\n\n[step3]. Present plan to user for confirmation\n\n[step4]. Validate segments.json\n\n[step5]. Generate and merge audio → intermediate files in <cwd>/audio/tmp/, final output in <cwd>/audio/output.mp3\n\n[step6]. ⚠️ CRITICAL: User confirms audio quality FIRST → THEN cleanup temp files (only after user is satisfied)\n\n<cwd> is Claude's current working directory (not the skill directory). Audio files are saved relative to where Claude is running commands.\n\nStep 1: Verify environment\npython check_environment.py\n\n\nChecks:\n\nPython 3.8+\nRequired packages (requests, websockets)\nFFmpeg installation\nMINIMAX_VOICE_API_KEY environment variable\n\nIf API key is not set, ask user for keys and set it:\n\nexport MINIMAX_VOICE_API_KEY=\"your-api-key-here\"\n\nStep 2: Decision and Pre-processing\n\n⚠️ MOST IMPORTANT PRINCIPLE: Gender Matching First\n\nBefore selecting voices, you MUST always match gender first. This is non-negotiable.\n\nGolden Rule:\n\nIf a character is male → use male voice If a character is female → use female voice If a character is neutral/other → choose appropriate neutral voice\n\nWhy this matters:\n\nViolating gender matching (e.g., male character with female voice) breaks immersion\nEven if personality traits match, gender comes first\nThis is especially critical for classic literature, historical content, and professional narration\n\nExamples:\n\nCharacter\tWrong Voice\tCorrect Voice\n唐三藏 (male monk)\tfemale-yujie ❌\tChinese (Mandarin)_Gentleman ✅\n林黛玉 (female)\tmale-qn-badao ❌\tfemale-shaonv ✅\n曹操 (male warlord)\tfemale-chengshu ❌\tChinese (Mandarin)_Unrestrained_Young_Man ✅\n\nDecision guide: Evaluate based on:\n\nDoes the user specify a model? → Use that model, or use the default one \"speech-2.8\"\nIs multi-voice needed? → Different voice_id per speaker/character\nFor speech-2.8: emotion is auto-matched (leave emotion empty)\nFor older models: manually specify emotion tags\n\nUse case scenarios:\n\nScenario\tDescription\tSegments\tVoice Selection\nSingle Voice\tUser needs one voice for the entire content. Segment only by length (≤1,000,000 chars per segment).\tSplit by length only\tOne voice_id for all segments\nMulti-Voice\tMultiple characters/speakers, each with different voice. Segment by speaker/role changes.\tSplit by logical unit (speaker, dialogue, etc.)\tDifferent voice_id per role\nPodcast/Interview\tHost and guest speakers with distinct voices.\tSplit by speaker\tVoice per host/guest\nAudiobook/Fiction\tNarrator and character voices.\tSplit by narration vs. dialogue\tVoice per narrator/character\nDocumentary\tMostly narration with occasional quotes.\tKeep as one segment\tSingle narrator voice\nReport/Announcement\tFormal content with consistent tone.\tKeep as one segment\tProfessional voice\n\nProcessing Workflow (4 sub-steps):\n\nStep 2.1: Text Segmentation and Role Analysis First, segment your text into logical units and identify the role/character for each segment.\n\nKey principle (Important!): Split by logical unit, NOT simply by sentence\n\nWhen to split (Important!):\n\nDifferent speakers clearly marked\nNarrator vs. character dialogue (in fiction/audiobooks/interview etc.)\nIn some scenarios (like audiobooks, multi-voice fiction etc.), where speaker's identity is important, split when narration and dialogue mix in the same sentence.\n\nWhen NOT to split (Important!):\n\nThird-person narration like \"John said...\" or \"The reporter noted...\"\nQuoted speech in narration (in documentary/podcast/report etc.) should keep in narrator's voice\nKeep in narrator's voice unless specific characterization is needed\n\nDecision depends on use case:\n\nUse case\tExample\tSplit strategy\nSingle Voice\tLong article, news piece, announcement\tSplit by length (≤1,000,000 chars), same voice for all\nPodcast/Interview\t\"Host: Welcome to the show. Guest: Thank you for having me.\"\tSplit by speaker\nDocumentary narration\t\"The scientist explained, 'The results are promising.'\"\tKeep as one segment (narrator voice)\nAudiobook/Fiction\t\"'Who's there?' she whispered.\"\tSplit: \"'Who's there?'\" should be in character voice, while \"she whispered.\" should be in narrator's voice\nReport\t\"According to the report, the economy is growing.\"\tKeep as one segment\n\nExample1: Single Voice (speech-2.8) For single-voice content (e.g., news, announcements, articles), segment only by length while maintaining the same voice:\n\n[\n  {\"text\": \"First part of the article (under 1,000,000 chars)...\", \"role\": \"narrator\", \"voice_id\": \"female-shaonv\", \"emotion\": \"\"},\n  {\"text\": \"Second part of the article (under 1,000,000 chars)...\", \"role\": \"narrator\", \"voice_id\": \"female-shaonv\", \"emotion\": \"\"},\n  {\"text\": \"Third part of the article (under 1,000,000 chars)...\", \"role\": \"narrator\", \"voice_id\": \"female-shaonv\", \"emotion\": \"\"}\n]\n\n\nExample2: Audiobook with characters (speech-2.8) In audiobooks (multi-voice fiction), split when narration and dialogue mix in the same sentence:\n\n[\n  {\"text\": \"The detective entered the room.\", \"role\": \"narrator\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"\\\"Who's there?\\\"\", \"role\": \"female_character\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"she whispered.\", \"role\": \"narrator\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"\\\"It's me,\\\"\", \"role\": \"male_character\", \"voice_id\": \"\", \"emotion\": \"\"},\n  {\"text\": \"he replied calmly.\", \"role\": \"narrator\", \"voice_id\": \"\", \"emotion\": \"\"}\n]\n\n\nExample3: Documentary/podcast narration (speech-2.8) Quoted speech in narration stays in narrator's voice (no need to split):\n\n[\n  {\n    \"text\": \"The scientist explained, \\\"The results show significant improvement in all test groups.\\\"\",\n    \"role\": \"narrator\",\n    \"voice_id\": \"\",\n    \"emotion\": \"\"\n  },\n  {\n    \"text\": \"According to the latest report, the economy has grown by 3% this quarter.\",\n    \"role\": \"narrator\",\n    \"voice_id\": \"\",\n    \"emotion\": \"\"\n  }\n]\n\n**Note:** In the preliminary `segments.json`:\n- Fill in the `text` field with segment content\n- Fill in the `role` field to identify the character (narrator, male_character, female_character, host, guest, etc.)\n- Leave `voice_id` empty (to be filled in Step 2.2)\n- Leave `emotion` empty for speech-2.8 models\n\n\n**Step 2.2: Voice Selection**\n\nAfter segmenting and labeling roles, analyze all detected characters in your text. Consult [voice_catalog.md](reference/voice_catalog.md) **Section 1 \"How to Choose a Voice\"** to match voices to characters.\n\n**⚠️ CRITICAL: Follow the two-step selection process below**\n\n**Path A — Professional domains (Story/Narration, News/Announcements, Documentary):**\nIf the content belongs to one of these three professional domains, prioritize selecting from the recommended voices in **voice_catalog.md Section 2.1** (filter by scenario + gender). These voices are specifically optimized for their professional use cases.\n\n**Path B — All other scenarios:**\nSelect from **voice_catalog.md Section 2.2**, following this strict priority hierarchy:\n\n1. **First: Match Gender** (non-negotiable) — Male characters MUST use male voices, female characters MUST use female voices\n2. **Second: Match Language** — The voice MUST match the content language (Chinese content → Chinese voice, Korean content → Korean voice, English content → English voice, etc.). Never assign a voice from the wrong language.\n3. **Third: Match Age** — Determine the age group (Children / Youth / Adult / Elderly / Professional) and select from the corresponding subsection in Section 2.2\n4. **Fourth: Match Personality & Role** — Choose the best fit based on personality traits, tone, and character role\n\n**Voice Selection Decision Tree:**\n\n\nIs this a professional domain (Story/News/Documentary)? ├── YES → Select from voice_catalog Section 2.1 (filter by scenario + gender) └── NO → Select from voice_catalog Section 2.2: Step 1: Match Gender ├── Male character → Male voices only └── Female character → Female voices only Step 2: Match Age Group └── Children / Youth / Adult / Elderly / Professional Step 3: Match Language └── Filter to voices matching the content language Step 4: Match Personality & Role └── Choose best fit by tone, personality, character role\n\n\n**Step 2.3: Emotions Segmentation** *(For non-2.8 series models only)*\nFor models other than speech-2.8 series, analyze emotions in your segments:\n- For **long segments**, split further based on **emotional transitions**\n- Add appropriate **emotion tags** to each segment\n- Refer to Section 3 in [text-processing.md](reference/text-processing.md) for emotion tags and examples\n- Skip this step for speech-2.8 models (emotion is auto-matched)\n\n**Emotion Tags:**\n- For speech-2.6 series (speech-2.6-hd and speech-2.6-turbo): happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper\n- For older models: happy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)\n\n\n**Step 2.4: Check and Post-processing**\nFinally, review and optimize your script:\n- Verify segment length limits (async TTS ≤1,000,000 characters)\n- Clean up conversational text (remove speaker names if needed)\n- Ensure consistency in voice and emotion tags\n- **Critical check for multi-voice content**: For audiobooks, multi-voice fiction, or content where dialogue is presented from a first-person perspective, verify that narration and dialogue mixed in the same sentence are properly split.\n\n  **When splitting IS needed (first-person dialogue in fiction/audiobooks):**\n  \n  Example: `\"John asked, 'Where are you going?'\"` should be split into:\n  - Segment 1: `\"John asked, \"` - uses narrator voice (describes who is speaking)\n  - Segment 2: `\"Where are you going?\"` - uses the character's voice (actual dialogue in first-person)\n\n  This ensures proper voice differentiation: descriptive narration uses the narrator's voice, while the character's spoken words use the character's designated voice.\n\n  **When splitting is NOT needed (third-person quotes in podcast/documentary/news):**\n  \n  In podcasts, documentaries, or news reports, quoted speech is typically presented in third-person narrative style - the speaker's words are being reported, not performed. Keep these as one segment with the narrator's voice and remove the speaker's name at the beginning:\n  \n  - `\"Welcome to our show.\" → narrator voice, remove the speaker's name (like \"The host said:\") at the beginning\n  - `\"According to experts, 'This technology represents a significant breakthrough.'\" → keep as one segment (narrator voice)\n  - `\"Scientists noted, 'The experimental results exceeded our expectations.'\" → keep as one segment (narrator voice)\n- **If the split is missing**: Go back to Step 2.1 and ensure dialogue portions are separated from narration with appropriate role labels.\n\n**Create segments.json:**\nAfter completing all 4 sub-steps, save the final `segments.json` to `<cwd>/audio/segments.json`.\n\n\n### Step 2.5: Generate Preview for User Confirmation (Highly Recommended)\n\n**For multi-voice content (audiobooks, dramas, etc.), always generate a preview first.**\n\nThis saves time and prevents waste when voice selections need adjustment.\n\n**How to generate a preview:**\n1. Create a smaller segments file with 10-20 representative segments (include all characters)\n2. Generate the preview audio\n3. Ask user to listen and confirm voice choices\n\n**Preview segments.json example:**\n```json\n[\n  {\"text\": \"Narration opening...\", \"role\": \"narrator\", \"voice_id\": \"...\", \"emotion\": \"\"},\n  {\"text\": \"Male character speaks...\", \"role\": \"male_character\", \"voice_id\": \"...\", \"emotion\": \"\"},\n  {\"text\": \"Female character speaks...\", \"role\": \"female_character\", \"voice_id\": \"...\", \"emotion\": \"\"},\n  {\"text\": \"More dialogue...\", \"role\": \"...\", \"voice_id\": \"...\", \"emotion\": \"\"}\n]\n\n\nPreview command:\n\npython mmvoice.py generate segments_preview.json -o preview.mp3\n\n\nWhen user confirms preview:\n\nUse the same voice selections for the full segments.json\nNo need to re-select voices\nStep 3: Present plan to user for confirmation\n\nBefore proceeding to validation and generation, present the segmentation plan to the user and wait for confirmation:\n\nPresent to the user:\n\nRoles identified: List all characters/speakers in the text\nVoice assignments: Show which voice_id is assigned to each role (include voice characteristics from voice_catalog.md)\nModel being used: Explain why this model was selected\nLanguage: Confirm the primary language of the content\nEmotion approach: Auto-matched (speech-2.8) or manual tags (older models)\n\nExample confirmation message:\n\nI've analyzed the text and created a segmentation plan:\n\n**Roles and Voices:**\n- Narrator: male-qn-jingying (deep, authoritative, suitable for storytelling)\n- Protagonist: female-shaonv (bright, energetic, youthful)\n- Antagonist: male-qn-qingse (cool, menacing)\n\n**Model:** speech-2.8-hd (recommended - automatic emotion matching)\n**Language:** Chinese\n**Segments:** 8 segments total\n\nPlease review and confirm:\n1. ⚠️ **Gender Verification**: Do the voice genders match the character genders?\n   - [Narrator: Male ✓] [Protagonist: Female ✓] [Antagonist: Male ✓]\n2. ⚠️ **Language Verification**: Do the voice languages match the content language?\n   - [All voices: Chinese ✓]\n3. Are the voice assignments appropriate for each character (age, personality)?\n4. Should any segments be combined or split differently?\n5. Any other changes you'd like to make?\n\n**After generation:**\n- I'll generate a preview first for you to review\n- Only after you confirm the audio quality will I clean up temporary files\n- If not satisfied, I'll re-generate and we iterate until you're happy\n\nReply \"confirm\" to proceed, or let me know what to adjust.\n\n\nWait for user response:\n\nIf user confirms → Proceed to Step 4 (validate)\nIf user suggests changes → Update segments.json and present the plan again for confirmation\nStep 4: Validate segments.json (model, emotion, voice_id validation)\n\nBefore generating audio, validate the segments file:\n\n# Default: speech-2.8-hd (auto emotion matching)\npython mmvoice.py validate <cwd>/audio/segments.json\n\n# Specify model for context-specific validation\npython mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd\n\n# Validate voice_ids against available voices (slower, requires API call)\npython mmvoice.py validate <cwd>/audio/segments.json --validate-voices\n\n# Combined options (recommended)\npython mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices\n\n# Use `--verbose` to see segment details\npython mmvoice.py validate <cwd>/audio/segments.json --model speech-2.6-hd --validate-voices --verbose\n\n\n\nEmotion Validation checks:\n\nModel\tEmotion Validation\nspeech-2.8-hd/turbo\tEmotion can be empty (auto emotion matching)\nspeech-2.6-hd/turbo\tAll 9 emotions supported\nOlder models\thappy, sad, angry, fearful, disgusted, surprised, calm (7 emotions)\n\nVoice ID validation: With --validate-voices:\n\nCalls API once to get all available voices\nValidates each voice_id against the list\nShows errors for invalid voice_ids (blocks validation)\nStep 5: Generate and merge audio\n\nGenerate audio for all segments and merge into final output.\n\nFile placement (default behavior if user doesn't specify):\n\n<cwd>/                      # Claude's current working directory\n└── audio/                  # Created automatically\n    ├── tmp/                # Intermediate segment files\n    │   ├── segment_0000.mp3\n    │   ├── segment_0001.mp3\n    │   └── ...\n    └── <custom_audio_name>.mp3             # Final merged audio, name can be customized\n\n\nWhere <cwd> is Claude's current working directory (where commands are executed).\n\nIf -o is not specified, output goes to <cwd>/audio/output.mp3\nIntermediate files go to <cwd>/audio/tmp/\nAfter user confirms the final audio, ask whether to delete <cwd>/audio/tmp/\n\nBasic usage:\n\n# Default: speech-2.8-hd, output to <cwd>/audio/output.mp3\npython mmvoice.py generate <cwd>/audio/segments.json\n\n# Specify output path\npython mmvoice.py generate <cwd>/audio/segments.json -o <cwd>/audio/<custom_audio_name>.mp3\n\n# Specify model if needed\npython mmvoice.py generate <cwd>/audio/segments.json --model speech-2.6-hd\n\n\nSkip existing segments (for rate limit retries):\n\n# Only generate segments that don't exist yet - skips already-generated files\npython mmvoice.py generate <cwd>/audio/segments.json --skip-existing\n\n\nError handling:\n\nIf a segment fails, the script reports which segment and why\nUse --continue-on-error to generate remaining segments despite failures\nUse --skip-existing to skip already successfully generated segments (recommended for retries after rate limit)\nThe script automatically uses fallback merging if FFmpeg filter_complex fails\nStep 6: Confirm and cleanup\n\n⚠️ CRITICAL: Never delete temp files until user confirms!\n\nAfter generation completes, you MUST follow this exact sequence:\n\nStep 6.1: Report generation result to user\n\n✓ Audio saved to: <output_path>\n  Generated: X/Y segments\n  Intermediate files in: <cwd>/audio/tmp/\n\n\nStep 6.2: Ask user to confirm audio quality Ask the user to listen to the audio and confirm:\n\nIs the audio quality satisfactory?\nAre all voices appropriate?\nAny adjustments needed?\n\nStep 6.3: Wait for user response\n\nStep 6.4: Only after user confirms, offer cleanup\n\nAfter confirming audio quality, temporary files can be deleted with:\nrm -rf <cwd>/audio/tmp/\n\n\nNEVER execute rm -rf on temp files without explicit user confirmation!\n\nIf user is NOT satisfied:\n\nDo NOT delete temp files\nDiscuss what needs to be adjusted\nRe-generate affected segments if needed\nAsk for confirmation again\nOther Usage\n\nUse the following when the task involves voice creation, single-voice TTS (sync/async), or audio processing instead of the main segment-based workflow. Each subsection gives CLI commands, script paths, and the reference doc to open for details.\n\nVoice creation (clone / design / list)\nPurpose: Create custom voices from audio (clone) or from a text description (design); list system and custom voices.\nCLI (entry point: mmvoice.py):\npython mmvoice.py clone AUDIO_FILE --voice-id VOICE_ID   # Clone from 10s–5min audio\npython mmvoice.py design \"DESCRIPTION\" --voice-id ID      # Design from text\npython mmvoice.py list-voices                             # List all voices\n\nScripts: scripts/voice_clone.py (clone), scripts/voice_design.py (design), scripts/voice_management.py (list/manage).\nDocumentation: reference/voice-guide.md — cloning (quick + high-quality + step-by-step), design workflow, management.\nText-to-speech (sync / async)\nPurpose: Single-voice TTS: sync for short text (≤10k chars), async for long text (up to 1M chars); optional streaming.\nCLI:\npython mmvoice.py tts \"TEXT\" -o OUTPUT.mp3 [-v VOICE_ID] [--model MODEL]\n\nScripts: scripts/sync_tts.py (HTTP/WebSocket sync), scripts/async_tts.py (async task + poll).\nDocumentation: reference/tts-guide.md — sync TTS, async TTS, streaming, segment-based production.\nAudio processing (merge / convert / normalize)\nPurpose: Merge files (with optional crossfade), convert format, normalize loudness, trim.\nCLI:\npython mmvoice.py merge FILE1 [FILE2 ...] -o OUTPUT [--crossfade MS]\npython mmvoice.py convert INPUT -o OUTPUT [--format FORMAT]\n\nScript: scripts/audio_processing.py (merge, convert, normalize, trim).\nDocumentation: reference/audio-guide.md — format conversion, merging (filter_complex + concat demuxer fallback), normalization, trimming, optimization.\nSegment-based TTS (main workflow)\nCLI: validate and generate as in Steps 4–5 above.\nScript: scripts/segment_tts.py.\nDocumentation: reference/cli-guide.md, reference/api_documentation.md.\nReference documents (on-demand)\n\nOpen these when you need concrete usage, parameters, or troubleshooting. Paths are relative to the skill root.\n\nDocument\tContent for the Agent\nreference/cli-guide.md\tAll CLI commands (validate, generate, tts, clone, design, list-voices, merge, convert, check-env) with options and examples. Use for correct CLI invocation.\nreference/getting-started.md\tEnvironment setup (venv, pip install, FFmpeg), MINIMAX_VOICE_API_KEY, basic synthesis test. Use for first-time setup or “env not working”.\nreference/tts-guide.md\tSync TTS (short text), async TTS (long text), streaming TTS, multi-segment production. Use for sync/async/streaming logic and parameters.\nreference/voice-guide.md\tVoice cloning (quick, high-quality with prompt audio, step-by-step), voice design, voice management. Use for custom voice creation flows.\nreference/audio-guide.md\tFormat conversion, merging (including crossfade and fallback), normalization, trimming, optimization. Use for merge/convert/normalize behavior and options.\nreference/script-examples.md\tCopy-paste runnable examples for sync TTS, async TTS, segment-based TTS, audio processing, voice clone/design/management. Use for quick Python snippets.\nreference/troubleshooting.md\tEnvironment (API key, FFmpeg), API errors, segment-based TTS, audio, voice. Use when an error message or unexpected behavior appears.\nreference/api_documentation.md\tFull API reference: config, sync/async TTS, emotion parameter, segment-based TTS, voice clone/design/management, audio processing, common parameters, error handling. Use for exact function signatures and parameter details.\nreference/voice_catalog.md\tSystem voices list (male/female/beta), selection guide, voice parameters, custom voices, voice IDs. Use to choose or look up voice_id.\nImportant notes\nRequirements\nPython: 3.8 or higher\nAPI Key: MINIMAX_VOICE_API_KEY environment variable must be set\nFFmpeg: Required for audio processing (merge, convert, normalize)\nInstall: brew install ffmpeg (macOS) or sudo apt install ffmpeg (Ubuntu)\nLimits and constraints\nText length: Sync TTS ≤10,000 chars; async TTS ≤1,000,000 chars\nVoice cloning: Audio must be 10s–5min duration, ≤20MB, formats: mp3/wav/m4a\nVoice expiration: Custom voices (cloned/designed) expire after 7 days if not used with TTS\nSpecial features\nPause insertion: Use <#x#> in text where x = pause duration in seconds (0.01–99.99)\nExample: \"Hello<#1.5#>world\" creates 1.5s pause between words\nSupported emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper\nspeech-2.8: automatic matching; speech-2.6: all 9; older models: first 7\nTroubleshooting\nRun python check_environment.py to diagnose setup issues\nSee troubleshooting.md for common problems and solutions\nCheck getting-started.md for detailed setup instructions"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/BLUE-coconut/mm-voice-maker",
    "publisherUrl": "https://clawhub.ai/BLUE-coconut/mm-voice-maker",
    "owner": "BLUE-coconut",
    "version": "1.0.1",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/mm-voice-maker",
    "downloadUrl": "https://openagent3.xyz/downloads/mm-voice-maker",
    "agentUrl": "https://openagent3.xyz/skills/mm-voice-maker/agent",
    "manifestUrl": "https://openagent3.xyz/skills/mm-voice-maker/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/mm-voice-maker/agent.md"
  }
}