{
  "schemaVersion": "1.0",
  "item": {
    "slug": "speech-to-text-transcription",
    "name": "Speech to Text Transcription",
    "source": "tencent",
    "type": "skill",
    "category": "AI 智能",
    "sourceUrl": "https://clawhub.ai/ivangdavila/speech-to-text-transcription",
    "canonicalUrl": "https://clawhub.ai/ivangdavila/speech-to-text-transcription",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/speech-to-text-transcription",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=speech-to-text-transcription",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "memory-template.md",
      "setup.md"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/speech-to-text-transcription"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/speech-to-text-transcription",
    "agentPageUrl": "https://openagent3.xyz/skills/speech-to-text-transcription/agent",
    "manifestUrl": "https://openagent3.xyz/skills/speech-to-text-transcription/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/speech-to-text-transcription/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "Setup",
        "body": "On first use, read setup.md and start helping with transcription needs."
      },
      {
        "title": "When to Use",
        "body": "User has audio or video files that need transcription. Agent handles local files, URLs, voice memos, podcasts, interviews, meetings, and lectures."
      },
      {
        "title": "Architecture",
        "body": "Memory lives in ~/speech-to-text-transcription/. See memory-template.md for structure.\n\n~/speech-to-text-transcription/\n├── memory.md        # Provider preferences, defaults\n├── transcripts/     # Saved transcriptions\n└── temp/            # Processing workspace"
      },
      {
        "title": "Quick Reference",
        "body": "TopicFileSetup processsetup.mdMemory templatememory-template.md"
      },
      {
        "title": "1. Detect File Type First",
        "body": "Before transcription, identify the input:\n\nLocal file path → verify exists, check format\nURL → download to temp, then process\nMeeting recording → likely needs speaker diarization\nVoice memo → usually single speaker, shorter"
      },
      {
        "title": "2. Choose Provider Based on Context",
        "body": "ScenarioBest ProviderWhyQuick local transcriptionWhisper (local)No API key, free, privateHigh accuracy neededOpenAI Whisper APIBest qualitySpeaker identificationAssemblyAINative diarizationReal-time/streamingDeepgramLow latencyLong content (>2 hours)Split + batchAvoid timeouts"
      },
      {
        "title": "3. Handle Long Audio",
        "body": "Files over 25MB or 2 hours:\n\nSplit into chunks (use ffmpeg)\nProcess each chunk\nMerge transcripts with proper timestamps\nNever attempt single upload for large files"
      },
      {
        "title": "4. Preserve Context",
        "body": "After transcription:\n\nAsk if user wants the transcript saved\nSuggest filename based on content\nOffer to extract action items or summary"
      },
      {
        "title": "5. Output Formats",
        "body": "Default to plain text. Offer alternatives:\n\n.txt — clean text, no timestamps\n.srt / .vtt — subtitles with timing\n.json — structured with word-level timing\n.md — formatted with speaker labels"
      },
      {
        "title": "Common Traps",
        "body": "Assuming one provider works for all → Whisper fails on diarization, AssemblyAI needs API key\nUploading huge files directly → Timeouts, memory errors. Split first.\nIgnoring audio quality → Noisy audio needs preprocessing (ffmpeg noise reduction)\nNot checking language → Whisper auto-detects but can fail on mixed-language content\nLosing speaker context → Multi-speaker content without diarization becomes unusable"
      },
      {
        "title": "Requirements",
        "body": "Required: ffmpeg (for audio processing)\n\nOptional API keys (only if using cloud providers):\n\nOPENAI_API_KEY — for OpenAI Whisper API\nASSEMBLYAI_API_KEY — for AssemblyAI (speaker diarization)\nDEEPGRAM_API_KEY — for Deepgram (real-time)\n\nLocal Whisper works without any API keys."
      },
      {
        "title": "Local Whisper (No API Key)",
        "body": "# Install\npip install openai-whisper\n\n# Basic transcription\nwhisper audio.mp3 --model base --output_format txt\n\n# With timestamps\nwhisper audio.mp3 --model medium --output_format srt\n\nModels: tiny (fast) → base → small → medium → large (accurate)"
      },
      {
        "title": "OpenAI Whisper API",
        "body": "curl -X POST https://api.openai.com/v1/audio/transcriptions \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -H \"Content-Type: multipart/form-data\" \\\n  -F file=\"@audio.mp3\" \\\n  -F model=\"whisper-1\""
      },
      {
        "title": "AssemblyAI (Speaker Diarization)",
        "body": "# Upload\ncurl -X POST https://api.assemblyai.com/v2/upload \\\n  -H \"Authorization: $ASSEMBLYAI_API_KEY\" \\\n  --data-binary @audio.mp3\n\n# Transcribe with speakers\ncurl -X POST https://api.assemblyai.com/v2/transcript \\\n  -H \"Authorization: $ASSEMBLYAI_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"audio_url\": \"URL\", \"speaker_labels\": true}'"
      },
      {
        "title": "Extract Audio from Video",
        "body": "ffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav"
      },
      {
        "title": "Reduce Noise",
        "body": "ffmpeg -i noisy.wav -af \"afftdn=nf=-25\" clean.wav"
      },
      {
        "title": "Split Long Audio",
        "body": "# Split into 10-minute chunks\nffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3"
      },
      {
        "title": "Security & Privacy",
        "body": "Data that stays local:\n\nTranscripts in ~/speech-to-text-transcription/transcripts/\nLocal Whisper processes entirely on-device\n\nData that leaves your machine (if using APIs):\n\nAudio file sent to chosen provider (OpenAI, AssemblyAI, Deepgram)\nTranscript returned and stored locally\n\nThis skill does NOT:\n\nStore API keys in plain text (use environment variables)\nAuto-upload without confirmation\nRetain files on external servers after processing"
      },
      {
        "title": "External Endpoints",
        "body": "EndpointData SentPurposeapi.openai.com/v1/audioAudio fileWhisper API transcriptionapi.assemblyai.com/v2Audio fileAssemblyAI transcriptionapi.deepgram.com/v1Audio streamDeepgram transcription\n\nOnly called when user explicitly chooses cloud provider. Local Whisper sends nothing."
      },
      {
        "title": "Trust",
        "body": "By using cloud transcription providers, audio data is sent to OpenAI, AssemblyAI, or Deepgram. Only install if you trust these services with your audio. For sensitive content, use local Whisper."
      },
      {
        "title": "Related Skills",
        "body": "Install with clawhub install <slug> if user confirms:\n\naudio — General audio processing\nffmpeg — Video and audio conversion\npodcast — Podcast creation and editing"
      },
      {
        "title": "Feedback",
        "body": "If useful: clawhub star speech-to-text-transcription\nStay updated: clawhub sync"
      }
    ],
    "body": "Setup\n\nOn first use, read setup.md and start helping with transcription needs.\n\nWhen to Use\n\nUser has audio or video files that need transcription. Agent handles local files, URLs, voice memos, podcasts, interviews, meetings, and lectures.\n\nArchitecture\n\nMemory lives in ~/speech-to-text-transcription/. See memory-template.md for structure.\n\n~/speech-to-text-transcription/\n├── memory.md        # Provider preferences, defaults\n├── transcripts/     # Saved transcriptions\n└── temp/            # Processing workspace\n\nQuick Reference\nTopic\tFile\nSetup process\tsetup.md\nMemory template\tmemory-template.md\nCore Rules\n1. Detect File Type First\n\nBefore transcription, identify the input:\n\nLocal file path → verify exists, check format\nURL → download to temp, then process\nMeeting recording → likely needs speaker diarization\nVoice memo → usually single speaker, shorter\n2. Choose Provider Based on Context\nScenario\tBest Provider\tWhy\nQuick local transcription\tWhisper (local)\tNo API key, free, private\nHigh accuracy needed\tOpenAI Whisper API\tBest quality\nSpeaker identification\tAssemblyAI\tNative diarization\nReal-time/streaming\tDeepgram\tLow latency\nLong content (>2 hours)\tSplit + batch\tAvoid timeouts\n3. Handle Long Audio\n\nFiles over 25MB or 2 hours:\n\nSplit into chunks (use ffmpeg)\nProcess each chunk\nMerge transcripts with proper timestamps\nNever attempt single upload for large files\n4. Preserve Context\n\nAfter transcription:\n\nAsk if user wants the transcript saved\nSuggest filename based on content\nOffer to extract action items or summary\n5. Output Formats\n\nDefault to plain text. Offer alternatives:\n\n.txt — clean text, no timestamps\n.srt / .vtt — subtitles with timing\n.json — structured with word-level timing\n.md — formatted with speaker labels\nCommon Traps\nAssuming one provider works for all → Whisper fails on diarization, AssemblyAI needs API key\nUploading huge files directly → Timeouts, memory errors. Split first.\nIgnoring audio quality → Noisy audio needs preprocessing (ffmpeg noise reduction)\nNot checking language → Whisper auto-detects but can fail on mixed-language content\nLosing speaker context → Multi-speaker content without diarization becomes unusable\nRequirements\n\nRequired: ffmpeg (for audio processing)\n\nOptional API keys (only if using cloud providers):\n\nOPENAI_API_KEY — for OpenAI Whisper API\nASSEMBLYAI_API_KEY — for AssemblyAI (speaker diarization)\nDEEPGRAM_API_KEY — for Deepgram (real-time)\n\nLocal Whisper works without any API keys.\n\nProvider Quick Reference\nLocal Whisper (No API Key)\n# Install\npip install openai-whisper\n\n# Basic transcription\nwhisper audio.mp3 --model base --output_format txt\n\n# With timestamps\nwhisper audio.mp3 --model medium --output_format srt\n\n\nModels: tiny (fast) → base → small → medium → large (accurate)\n\nOpenAI Whisper API\ncurl -X POST https://api.openai.com/v1/audio/transcriptions \\\n  -H \"Authorization: Bearer $OPENAI_API_KEY\" \\\n  -H \"Content-Type: multipart/form-data\" \\\n  -F file=\"@audio.mp3\" \\\n  -F model=\"whisper-1\"\n\nAssemblyAI (Speaker Diarization)\n# Upload\ncurl -X POST https://api.assemblyai.com/v2/upload \\\n  -H \"Authorization: $ASSEMBLYAI_API_KEY\" \\\n  --data-binary @audio.mp3\n\n# Transcribe with speakers\ncurl -X POST https://api.assemblyai.com/v2/transcript \\\n  -H \"Authorization: $ASSEMBLYAI_API_KEY\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"audio_url\": \"URL\", \"speaker_labels\": true}'\n\nAudio Preprocessing\nExtract Audio from Video\nffmpeg -i video.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 audio.wav\n\nReduce Noise\nffmpeg -i noisy.wav -af \"afftdn=nf=-25\" clean.wav\n\nSplit Long Audio\n# Split into 10-minute chunks\nffmpeg -i long.mp3 -f segment -segment_time 600 -c copy chunk_%03d.mp3\n\nSecurity & Privacy\n\nData that stays local:\n\nTranscripts in ~/speech-to-text-transcription/transcripts/\nLocal Whisper processes entirely on-device\n\nData that leaves your machine (if using APIs):\n\nAudio file sent to chosen provider (OpenAI, AssemblyAI, Deepgram)\nTranscript returned and stored locally\n\nThis skill does NOT:\n\nStore API keys in plain text (use environment variables)\nAuto-upload without confirmation\nRetain files on external servers after processing\nExternal Endpoints\nEndpoint\tData Sent\tPurpose\napi.openai.com/v1/audio\tAudio file\tWhisper API transcription\napi.assemblyai.com/v2\tAudio file\tAssemblyAI transcription\napi.deepgram.com/v1\tAudio stream\tDeepgram transcription\n\nOnly called when user explicitly chooses cloud provider. Local Whisper sends nothing.\n\nTrust\n\nBy using cloud transcription providers, audio data is sent to OpenAI, AssemblyAI, or Deepgram. Only install if you trust these services with your audio. For sensitive content, use local Whisper.\n\nRelated Skills\n\nInstall with clawhub install <slug> if user confirms:\n\naudio — General audio processing\nffmpeg — Video and audio conversion\npodcast — Podcast creation and editing\nFeedback\nIf useful: clawhub star speech-to-text-transcription\nStay updated: clawhub sync"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/ivangdavila/speech-to-text-transcription",
    "publisherUrl": "https://clawhub.ai/ivangdavila/speech-to-text-transcription",
    "owner": "ivangdavila",
    "version": "1.0.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/speech-to-text-transcription",
    "downloadUrl": "https://openagent3.xyz/downloads/speech-to-text-transcription",
    "agentUrl": "https://openagent3.xyz/skills/speech-to-text-transcription/agent",
    "manifestUrl": "https://openagent3.xyz/skills/speech-to-text-transcription/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/speech-to-text-transcription/agent.md"
  }
}