# Send Audio Speaker Tools to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "audio-speaker-tools",
    "name": "Audio Speaker Tools",
    "source": "tencent",
    "type": "skill",
    "category": "AI 智能",
    "sourceUrl": "https://clawhub.ai/cmfinlan/audio-speaker-tools",
    "canonicalUrl": "https://clawhub.ai/cmfinlan/audio-speaker-tools",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/audio-speaker-tools",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=audio-speaker-tools",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "SKILL.md",
      "references/elevenlabs-cloning.md",
      "references/scoring-guide.md",
      "scripts/compare_voices.py",
      "scripts/diarize_and_slice_mps.py",
      "scripts/setup_venv.sh"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-23T16:43:11.935Z",
      "expiresAt": "2026-04-30T16:43:11.935Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=4claw-imageboard",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=4claw-imageboard",
        "contentDisposition": "attachment; filename=\"4claw-imageboard-1.0.1.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/audio-speaker-tools"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/audio-speaker-tools",
    "downloadUrl": "https://openagent3.xyz/downloads/audio-speaker-tools",
    "agentUrl": "https://openagent3.xyz/skills/audio-speaker-tools/agent",
    "manifestUrl": "https://openagent3.xyz/skills/audio-speaker-tools/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/audio-speaker-tools/agent.md"
  }
}
```
## Documentation

### Audio Speaker Tools

Tools for speaker separation, voice comparison, and audio processing using Demucs, pyannote, and Resemblyzer.

### Overview

This skill provides three main workflows:

Speaker separation - Extract per-speaker audio from multi-speaker recordings
Voice comparison - Measure speaker similarity between two audio files
Audio processing - Segment extraction and voice isolation

### Setup Virtual Environment

Run once to create the venv and install dependencies:

bash scripts/setup_venv.sh

Default venv location: ./.venv

Requirements:

Python 3.9+
ffmpeg (brew install ffmpeg)
HuggingFace token (set as env var HF_TOKEN)

### 1. Speaker Separation: diarize_and_slice_mps.py

Separate speakers from multi-speaker audio:

# Basic usage
HF_TOKEN=<your-hf-token> \\
  /path/to/venv/bin/python scripts/diarize_and_slice_mps.py \\
  --input audio.mp3 \\
  --outdir /path/to/output \\
  --prefix MyShow

# With speaker constraints
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \\
  --input audio.mp3 \\
  --outdir ./out \\
  --min-speakers 2 \\
  --max-speakers 5 \\
  --pad-ms 100

Process:

Converts input to 16kHz mono WAV
Runs Demucs vocal/background separation (optional, for cleaner input)
Runs pyannote speaker diarization (MPS-accelerated)
Extracts concatenated per-speaker WAV files

Output:

<prefix>_speaker1.wav, <prefix>_speaker2.wav, etc. (one per detected speaker)
diarization.rttm (time-stamped speaker segments)
segments.jsonl (JSON segments metadata)
meta.json (pipeline info and speaker index)

Important:

Always pass HF token via HF_TOKEN env var, never as CLI arg
MPS first, CPU fallback - Script prefers Metal GPU, falls back to CPU if unavailable
Default output: ./separated/

### 2. Voice Comparison: compare_voices.py

Measure similarity between two voice samples using Resemblyzer:

# Basic comparison
python scripts/compare_voices.py \\
  --audio1 sample1.wav \\
  --audio2 sample2.wav

# JSON output
python scripts/compare_voices.py \\
  --audio1 reference.wav \\
  --audio2 clone.wav \\
  --threshold 0.85 \\
  --json

# Exit code = 0 if pass, 1 if fail

Scores:

< 0.75 = Different speakers
0.75-0.84 = Likely same speaker
0.85+ = Excellent match (ideal for voice cloning validation)

Use cases:

Voice clone quality assessment (compare clone vs. original)
Speaker verification (authenticate speaker identity)
Validate speaker separation (confirm separated speakers are distinct)

See: references/scoring-guide.md for detailed interpretation

### 3. Audio Trimming

Use ffmpeg directly for segment extraction:

# Extract 10-second segment starting at 5 seconds
ffmpeg -i input.mp3 -ss 5 -t 10 -c copy output.mp3

# Extract vocals only with Demucs (before diarization)
demucs --two-stems vocals --out ./separated input.mp3

### Workflow 1: Extract Clean Voice Sample for Cloning

Goal: Get a clean, single-speaker sample for ElevenLabs voice cloning

# 1. Separate speakers
HF_TOKEN=<your-hf-token> python scripts/diarize_and_slice_mps.py \\
  --input podcast.mp3 --outdir ./out --prefix Podcast

# 2. Review speaker files (out/Podcast_speaker1.wav, etc.)

# 3. Select best sample (5-30s, clean speech)
ffmpeg -i out/Podcast_speaker2.wav -ss 10 -t 20 -c copy sample.wav

# 4. Upload to ElevenLabs as instant voice clone

See: references/elevenlabs-cloning.md for best practices

### Workflow 2: Validate Voice Clone Quality

Goal: Measure how well a cloned voice matches the original

# 1. Generate test audio with ElevenLabs clone
# (done via ElevenLabs web UI or API)

# 2. Compare clone vs. reference
python scripts/compare_voices.py \\
  --audio1 original_sample.wav \\
  --audio2 elevenlabs_clone.wav \\
  --threshold 0.85 \\
  --json

# 3. Interpret score:
#    0.85+ = excellent, publish-ready
#    0.80-0.84 = acceptable, may need tweaking
#    < 0.80 = poor, try different sample or settings

See: references/scoring-guide.md for troubleshooting low scores

### Workflow 3: Multi-Speaker Conversation Analysis

Goal: Separate and identify speakers in a conversation

# 1. Run diarization
HF_TOKEN=$TOKEN python scripts/diarize_and_slice_mps.py \\
  --input meeting.mp3 --outdir ./out --prefix Meeting

# 2. Check detected speakers (meta.json)
cat out/meta.json

# 3. Compare speaker pairs to confirm separation
python scripts/compare_voices.py \\
  --audio1 out/Meeting_speaker1.wav \\
  --audio2 out/Meeting_speaker2.wav

# Expected: < 0.75 if separation worked correctly

### Device Acceleration

pyannote diarization: MPS (Metal) by default, CPU fallback
Resemblyzer: CPU only (no GPU acceleration)
Demucs: MPS by default when available

To force CPU for diarization: --device cpu

### Audio Formats

Input: Any format supported by ffmpeg (wav, mp3, flac, m4a, etc.)
Processing: Internally converted to 16kHz mono WAV for diarization
Output: WAV format (44.1kHz stereo preserved from source)

### HuggingFace Token

Required for: pyannote speaker diarization
Access: Must accept gated repo pyannote/speaker-diarization-3.1 on HF
Storage: Any secure secrets manager
Usage: Always pass via HF_TOKEN env var, never CLI arg

### Sample Quality Tips

Shorter is better: 5-30s clean samples often score higher than 60+ second samples
Clean audio: Remove background noise with Demucs --two-stems vocals
Single speaker: Ensure isolated voice, not mixed conversation
Good recording: Studio mic > phone mic for voice comparison accuracy

### References

elevenlabs-cloning.md - Best practices for ElevenLabs instant voice cloning (model settings, sample selection, proven configurations)
scoring-guide.md - How to interpret Resemblyzer similarity scores (thresholds, use cases, troubleshooting)

### "Missing HF token" error

Export token before running: export HF_TOKEN=<your-token>
Or pass inline: HF_TOKEN=<your-token> python script.py ...

### Low voice comparison scores for same speaker

Try shorter, cleaner samples (5-30s)
Use Demucs to isolate vocals: demucs --two-stems vocals input.mp3
Ensure consistent recording quality (same mic, environment)
See references/scoring-guide.md troubleshooting section

### Diarization not detecting all speakers

Adjust --min-speakers and --max-speakers flags
Check audio quality (clear speech, minimal overlap)
Try longer audio (30+ seconds) for better speaker modeling

### MPS/Metal acceleration not working

Ensure PyTorch with MPS support: python -c "import torch; print(torch.backends.mps.is_available())"
Fallback to CPU: --device cpu
Re-run setup_venv.sh to reinstall PyTorch
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: cmfinlan
- Version: 1.0.0
## Source health
- Status: healthy
- Source download looks usable.
- Yavira can redirect you to the upstream package for this source.
- Health scope: source
- Reason: direct_download_ok
- Checked at: 2026-04-23T16:43:11.935Z
- Expires at: 2026-04-30T16:43:11.935Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/audio-speaker-tools)
- [Send to Agent page](https://openagent3.xyz/skills/audio-speaker-tools/agent)
- [JSON manifest](https://openagent3.xyz/skills/audio-speaker-tools/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/audio-speaker-tools/agent.md)
- [Download page](https://openagent3.xyz/downloads/audio-speaker-tools)