Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.
Extract PDF content to Markdown using MinerU API. Supports formulas, tables, OCR. Provides both local file and online URL parsing methods.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Extract PDF documents to structured Markdown using the MinerU API. Supports formula recognition, table extraction, and OCR. Note: This is a community skill, not an official MinerU product. You need to obtain your own API key from MinerU.
mineru-pdf-extractor/ โโโ SKILL.md # English documentation โโโ SKILL_zh.md # Chinese documentation โโโ docs/ # Documentation โ โโโ Local_File_Parsing_Guide.md # Local PDF parsing detailed guide (English) โ โโโ Online_URL_Parsing_Guide.md # Online PDF parsing detailed guide (English) โ โโโ MinerU_ๆฌๅฐๆๆกฃ่งฃๆๅฎๆดๆต็จ.md # Local parsing complete guide (Chinese) โ โโโ MinerU_ๅจ็บฟๆๆกฃ่งฃๆๅฎๆดๆต็จ.md # Online parsing complete guide (Chinese) โโโ scripts/ # Executable scripts โโโ local_file_step1_apply_upload_url.sh # Local parsing Step 1 โโโ local_file_step2_upload_file.sh # Local parsing Step 2 โโโ local_file_step3_poll_result.sh # Local parsing Step 3 โโโ local_file_step4_download.sh # Local parsing Step 4 โโโ online_file_step1_submit_task.sh # Online parsing Step 1 โโโ online_file_step2_poll_result.sh # Online parsing Step 2
Scripts automatically read MinerU Token from environment variables (choose one): # Option 1: Set MINERU_TOKEN export MINERU_TOKEN="your_api_token_here" # Option 2: Set MINERU_API_KEY export MINERU_API_KEY="your_api_token_here"
curl - For HTTP requests (usually pre-installed) unzip - For extracting results (usually pre-installed)
jq - For enhanced JSON parsing and security (recommended but not required) If not installed, scripts will use fallback methods Install: apt-get install jq (Debian/Ubuntu) or brew install jq (macOS)
# Set API base URL (default is pre-configured) export MINERU_BASE_URL="https://mineru.net/api/v4" ๐ก Get Token: Visit https://mineru.net/apiManage/docs to register and obtain an API Key
For locally stored PDF files. Requires 4 steps.
cd scripts/ # Step 1: Apply for upload URL ./local_file_step1_apply_upload_url.sh /path/to/your.pdf # Output: BATCH_ID=xxx UPLOAD_URL=xxx # Step 2: Upload file ./local_file_step2_upload_file.sh "$UPLOAD_URL" /path/to/your.pdf # Step 3: Poll for results ./local_file_step3_poll_result.sh "$BATCH_ID" # Output: FULL_ZIP_URL=xxx # Step 4: Download results ./local_file_step4_download.sh "$FULL_ZIP_URL" result.zip extracted/
local_file_step1_apply_upload_url.sh Apply for upload URL and batch_id. Usage: ./local_file_step1_apply_upload_url.sh <pdf_file_path> [language] [layout_model] Parameters: language: ch (Chinese), en (English), auto (auto-detect), default ch layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo Output: BATCH_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx UPLOAD_URL=https://mineru.oss-cn-shanghai.aliyuncs.com/... local_file_step2_upload_file.sh Upload PDF file to the presigned URL. Usage: ./local_file_step2_upload_file.sh <upload_url> <pdf_file_path> local_file_step3_poll_result.sh Poll extraction results until completion or failure. Usage: ./local_file_step3_poll_result.sh <batch_id> [max_retries] [retry_interval_seconds] Output: FULL_ZIP_URL=https://cdn-mineru.openxlab.org.cn/pdf/.../xxx.zip local_file_step4_download.sh Download result ZIP and extract. Usage: ./local_file_step4_download.sh <zip_url> [output_zip_filename] [extract_directory_name] Output Structure: extracted/ โโโ full.md # ๐ Markdown document (main result) โโโ images/ # ๐ผ๏ธ Extracted images โโโ content_list.json # Structured content โโโ layout.json # Layout analysis data
๐ Complete Guide: See docs/Local_File_Parsing_Guide.md
For PDF files already available online (e.g., arXiv, websites). Only 2 steps, more concise and efficient.
cd scripts/ # Step 1: Submit parsing task (provide URL directly) ./online_file_step1_submit_task.sh "https://arxiv.org/pdf/2410.17247.pdf" # Output: TASK_ID=xxx # Step 2: Poll results and auto-download/extract ./online_file_step2_poll_result.sh "$TASK_ID" extracted/
online_file_step1_submit_task.sh Submit parsing task for online PDF. Usage: ./online_file_step1_submit_task.sh <pdf_url> [language] [layout_model] Parameters: pdf_url: Complete URL of the online PDF (required) language: ch (Chinese), en (English), auto (auto-detect), default ch layout_model: doclayout_yolo (fast), layoutlmv3 (accurate), default doclayout_yolo Output: TASK_ID=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx online_file_step2_poll_result.sh Poll extraction results, automatically download and extract when complete. Usage: ./online_file_step2_poll_result.sh <task_id> [output_directory] [max_retries] [retry_interval_seconds] Output Structure: extracted/ โโโ full.md # ๐ Markdown document (main result) โโโ images/ # ๐ผ๏ธ Extracted images โโโ content_list.json # Structured content โโโ layout.json # Layout analysis data
๐ Complete Guide: See docs/Online_URL_Parsing_Guide.md
FeatureLocal PDF ParsingOnline PDF ParsingSteps4 steps2 stepsUpload Requiredโ Yesโ NoAverage Time30-60 seconds10-20 secondsUse CaseLocal filesFiles already online (arXiv, websites, etc.)File Size Limit200MBLimited by source server
for pdf in /path/to/pdfs/*.pdf; do echo "Processing: $pdf" # Step 1 result=$(./local_file_step1_apply_upload_url.sh "$pdf" 2>&1) batch_id=$(echo "$result" | grep BATCH_ID | cut -d= -f2) upload_url=$(echo "$result" | grep UPLOAD_URL | cut -d= -f2) # Step 2 ./local_file_step2_upload_file.sh "$upload_url" "$pdf" # Step 3 zip_url=$(./local_file_step3_poll_result.sh "$batch_id" | grep FULL_ZIP_URL | cut -d= -f2) # Step 4 filename=$(basename "$pdf" .pdf) ./local_file_step4_download.sh "$zip_url" "${filename}.zip" "${filename}_extracted" done
for url in \ "https://arxiv.org/pdf/2410.17247.pdf" \ "https://arxiv.org/pdf/2409.12345.pdf"; do echo "Processing: $url" # Step 1 result=$(./online_file_step1_submit_task.sh "$url" 2>&1) task_id=$(echo "$result" | grep TASK_ID | cut -d= -f2) # Step 2 filename=$(basename "$url" .pdf) ./online_file_step2_poll_result.sh "$task_id" "${filename}_extracted" done
Token Configuration: Scripts prioritize MINERU_TOKEN, fall back to MINERU_API_KEY if not found Token Security: Do not hard-code tokens in scripts; use environment variables URL Accessibility: For online parsing, ensure the provided URL is publicly accessible File Limits: Single file recommended not exceeding 200MB, maximum 600 pages Network Stability: Ensure stable network when uploading large files Security: This skill includes input validation and sanitization to prevent JSON injection and directory traversal attacks Optional jq: Installing jq provides enhanced JSON parsing and additional security checks
DocumentDescriptiondocs/Local_File_Parsing_Guide.mdDetailed curl commands and parameters for local PDF parsingdocs/Online_URL_Parsing_Guide.mdDetailed curl commands and parameters for online PDF parsing External Resources: ๐ MinerU Official: https://mineru.net/ ๐ API Documentation: https://mineru.net/apiManage/docs ๐ป GitHub Repository: https://github.com/opendatalab/MinerU Skill Version: 1.0.0 Release Date: 2026-02-18 Community Skill - Not affiliated with MinerU official
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.