Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Turn scripts into publishable voiceovers with Voice.ai TTS, including segments, chapters, captions, and video muxing.
Turn scripts into publishable voiceovers with Voice.ai TTS, including segments, chapters, captions, and video muxing.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
This skill follows the Agent Skills specification. Turn any script into a publish-ready voiceover โ complete with numbered segments, a stitched master, YouTube chapters, SRT captions, and a beautiful review page. Optionally, replace the audio track on an existing video. Built for creators who want studio-quality voiceovers without the studio. Powered by Voice.ai.
ScenarioWhy it fitsYouTube long-formFull narration with chapter markers and captionsYouTube ShortsQuick hooks with the shortform templatePodcastsConsistent host voice, intro/outro templatesCourse contentProfessional narration for educational videosQuick iterationSmart caching โ edit one section, only that segment re-rendersVideo audio replacementDrop AI voiceover onto screen recordings or B-roll
Have a script and a video? Turn them into a finished video with AI voiceover in one shot: node voiceai-vo.cjs build \ --input my-script.md \ --voice oliver \ --title "My Video" \ --video ./my-recording.mp4 \ --mux This renders the voiceover, stitches the master audio, and drops it onto your video โ all in one command. Output: out/my-video/muxed.mp4 โ your video with the new voiceover out/my-video/master.wav โ the standalone audio out/my-video/review.html โ listen and review each segment out/my-video/chapters.txt โ YouTube-ready chapter timestamps out/my-video/captions.srt โ SRT captions Use --sync pad if the audio is shorter than the video, or --sync trim to cut it to match.
Node.js 20+ โ runtime (no npm install needed โ the CLI is a single bundled file) VOICE_AI_API_KEY โ set as environment variable or in a .env file in the skill root. Get a key at voice.ai/dashboard. ffmpeg (optional) โ needed for master stitching, MP3 encoding, loudness normalization, and video muxing. The pipeline still produces individual segments, the review page, chapters, and captions without it.
The skill reads VOICE_AI_API_KEY from (in order): Environment variable VOICE_AI_API_KEY Environment variable VOICEAI_API_KEY (alternate) .env file in the skill root echo 'VOICE_AI_API_KEY=your-key-here' > .env Use --mock on any command to run the full pipeline without an API key (produces placeholder audio).
node voiceai-vo.cjs build \ --input <script.md or script.txt> \ --voice <voice-alias-or-uuid> \ --title "My Project" \ [--template youtube|podcast|shortform] \ [--language en] \ [--video input.mp4 --mux --sync shortest] \ [--force] [--mock] What it does: Reads the script and splits it into segments (by ## headings for .md, or by sentence boundaries for .txt) Optionally prepends/appends template intro/outro segments Renders each segment via Voice.ai TTS as a numbered WAV file Stitches a master audio file (if ffmpeg is available) Generates chapters, captions, a review page, and metadata files Optionally muxes the voiceover into an existing video Full options: OptionDescription-i, --input <path>Script file (.txt or .md) โ required-v, --voice <id>Voice alias or UUID โ required-t, --title <title>Project title (defaults to filename)--template <name>youtube, podcast, or shortform--mode <mode>headings or auto (default: headings for .md)--max-chars <n>Max characters per auto-chunk (default: 1500)--language <code>Language code (default: en)--video <path>Input video for muxing--muxEnable video muxing (requires --video)--sync <policy>shortest, pad, or trim (default: shortest)--forceRe-render all segments (ignore cache)--mockMock mode โ no API calls, placeholder audio-o, --out <dir>Custom output directory
node voiceai-vo.cjs replace-audio \ --video ./input.mp4 \ --audio ./out/my-project/master.wav \ [--out ./out/my-project/muxed.mp4] \ [--sync shortest|pad|trim] Requires ffmpeg. If not installed, generates helper shell/PowerShell scripts instead. Sync policyBehaviorshortest (default)Output ends when the shorter track endspadPad audio with silence to match video durationtrimTrim audio to match video duration Video stream is copied without re-encoding (-c:v copy). Audio is encoded as AAC. A mux report is saved alongside the output. Privacy: Video processing is entirely local. Only script text is sent to Voice.ai for TTS.
node voiceai-vo.cjs voices [--limit 20] [--query "deep"] [--mock]
Use short aliases or full UUIDs with --voice: AliasVoiceGenderStyleellieEllieFYouthful, vibrant vloggeroliverOliverMFriendly BritishlilithLilithFSoft, femininesmoothSmooth Calm VoiceMDeep, smooth narratorcorpseCorpse HusbandMDeep, distinctiveskadiSkadiFAnime characterzhongliZhongliMDeep, authoritativefloraFloraFCheerful, high pitchchiefMaster ChiefMHeroic, commanding The voices command also returns any additional voices available on the API. Voice list is cached for 10 minutes.
After a build, the output directory contains: out/<title-slug>/ segments/ # Numbered WAV files (001-intro.wav, 002-section.wav, โฆ) master.wav # Stitched audio (requires ffmpeg) master.mp3 # MP3 encode (requires ffmpeg) manifest.json # Build metadata: voice, template, segment list, hashes timeline.json # Segment durations and start times review.html # Interactive review page with audio players chapters.txt # YouTube-friendly chapter timestamps captions.srt # SRT captions using segment boundaries description.txt # YouTube description with chapters + Voice.ai credit
A standalone HTML page with: Master audio player (if stitched) Individual segment players with titles and durations Collapsible script text for each segment Regeneration command hints
Templates auto-inject intro/outro segments around the script content: TemplatePrependsAppendsyoutubetemplates/youtube_intro.txttemplates/youtube_outro.txtpodcasttemplates/podcast_intro.txtโshortformtemplates/shortform_hook.txtโ Edit the files in templates/ to customize the intro/outro text.
Segments are cached by a hash of: text content + voice ID + language. Unchanged segments are skipped on rebuild โ fast iteration Modified segments are re-rendered automatically Use --force to re-render everything Cache manifest is stored in segments/.cache.json
Voice.ai supports 11 languages. Use --language <code> to switch: en, es, fr, de, it, pt, pl, ru, nl, sv, ca The pipeline auto-selects the multilingual TTS model for non-English languages.
IssueSolutionffmpeg missingPipeline still works โ you get segments, review page, chapters, captions. Install ffmpeg for master stitching and video muxing.Rate limits (429)Segments render sequentially, which stays under most limits. Wait and retry.Insufficient credits (402)Top up at voice.ai/dashboard. Cached segments won't re-use credits on retry.Long scriptsCaching makes rebuilds fast. Text over 490 chars per segment is automatically split across API calls.Windows pathsWrap paths with spaces in quotes: --input "C:\My Scripts\script.md" See references/TROUBLESHOOTING.md for more.
Agent Skills Specification Voice.ai references/VOICEAI_API.md โ API endpoints, audio formats, models references/TROUBLESHOOTING.md โ Common issues and fixes
Writing, remixing, publishing, visual generation, and marketing content production.
Largest current source with strong distribution and engagement signals.