Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio trans...
Automatic Speech Recognition (ASR) using Zhipu AI (BigModel) GLM-ASR model. Use when you need to transcribe audio files to text. Supports Chinese audio trans...
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Transcribe Chinese audio files to text using Zhipu AI's GLM-ASR model.
1. Get your API Key: Get a key from Zhipu AI Console 2. Set it in your environment: export ZHIPU_API_KEY="your-key-here"
WAV - Recommended, best quality MP3 - Widely supported OGG - Auto-converted to MP3 M4A - Auto-converted to MP3 AAC - Auto-converted to MP3 FLAC - Auto-converted to MP3 WMA - Auto-converted to MP3 Note: The script automatically converts unsupported formats to MP3 using ffmpeg. Only WAV and MP3 are accepted by the API, but you can use any format that ffmpeg supports.
Maximum file size: 25 MB Maximum duration: 30 seconds Recommended sample rate: 16000 Hz or higher Audio channels: Mono or stereo
Transcribe an audio file with default settings: bash scripts/speech_to_text.sh recording.wav
Provide previous transcription or context for better accuracy: bash scripts/speech_to_text.sh recording.wav "这是之前的转录内容,有助于提高准确性"
Use custom vocabulary to improve recognition of specific terms: bash scripts/speech_to_text.sh recording.mp3 "" "人名,地名,专业术语,公司名称"
Combine context and hotwords: bash scripts/speech_to_text.sh recording.wav "会议记录片段" "张三,李四,项目名称" Parameters: audio_file (required): Path to audio file (.wav or .mp3) prompt (optional): Previous transcription or context text (max 8000 chars) hotwords (optional): Comma-separated list of specific terms (max 100 words)
Why use context prompts: Improves accuracy in long conversations Helps with domain-specific terminology Maintains consistency across multiple segments When to use: Multi-part conversations or meetings Technical or specialized content Continuing from previous transcriptions Example: bash scripts/speech_to_text.sh part2.wav "第一部分的转录内容:讨论了项目进展和下一步计划"
What are hotwords: Custom vocabulary list that boosts recognition accuracy for specific terms. Best use cases: Proper names (people, places) Domain-specific terminology Company names and products Technical jargon Industry-specific terms Examples: # Medical transcription bash scripts/speech_to_text.sh medical.wav "" "患者,症状,诊断,治疗方案" # Business meeting bash scripts/speech_to_text.sh meeting.wav "" "张经理,李总,项目代号,预算" # Tech discussion bash scripts/speech_to_text.sh tech.wav "" "API,数据库,算法,框架"
# Part 1 bash scripts/speech_to_text.sh meeting_part1.wav # Part 2 with context bash scripts/speech_to_text.sh meeting_part2.wav "第一部分讨论了项目进度" "张总,李经理,项目名称" # Part 3 with context bash scripts/speech_to_text.sh meeting_part3.wav "前两部分讨论了项目进度和预算" "张总,李经理,项目名称"
bash scripts/speech_to_text.sh lecture.wav "" "教授,课程名称,专业术语1,专业术语2"
for file in recording_*.wav; do bash scripts/speech_to_text.sh "$file" done
Best practices for accurate transcription: Clear audio source Minimize background noise Use good quality microphone Speak clearly and at moderate pace Optimal audio settings Sample rate: 16000 Hz or higher Bit depth: 16-bit or higher Single channel (mono) is sufficient File preparation Remove silence from beginning/end Normalize audio levels Ensure consistent volume
The script outputs JSON with: id: Task ID created: Request timestamp (Unix timestamp) request_id: Unique request identifier model: Model name used text: Transcribed text Example output: { "id": "task-12345", "created": 1234567890, "request_id": "req-abc123", "model": "glm-asr-2512", "text": "你好,这是转录的文本内容" }
File Size Issues: Split audio files larger than 25 MB Reduce sample rate or bit depth Use compression (MP3) for smaller files Duration Issues: Split recordings longer than 30 seconds Process segments separately Use context prompts to maintain continuity Poor Accuracy: Improve audio quality Use hotwords for specific terms Provide context prompts Ensure clear speech and minimal noise Format Issues: Ensure file is .wav or .mp3 Check file is not corrupted Verify audio can be played by standard players
Maximum audio duration: 30 seconds per request File size limit: 25 MB Maximum hotwords: 100 terms Context prompt limit: 8000 characters Best performance with Chinese language audio
Typical transcription time: 1-3 seconds Real-time or faster for most audio Processing time scales with audio quality and length
Agent frameworks, memory systems, reasoning layers, and model-native orchestration.
Largest current source with strong distribution and engagement signals.