{
  "schemaVersion": "1.0",
  "item": {
    "slug": "mineru-pdf",
    "name": "Mineru Pdf",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/Etoile04/mineru-pdf",
    "canonicalUrl": "https://clawhub.ai/Etoile04/mineru-pdf",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadMode": "redirect",
    "downloadUrl": "/downloads/mineru-pdf",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=mineru-pdf",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "installMethod": "Manual import",
    "extraction": "Extract archive",
    "prerequisites": [
      "OpenClaw"
    ],
    "packageFormat": "ZIP package",
    "includedAssets": [
      "SKILL.md",
      "parse.py",
      "test.sh"
    ],
    "primaryDoc": "SKILL.md",
    "quickSetup": [
      "Download the package from Yavira.",
      "Extract the archive and review SKILL.md first.",
      "Import or place the package into your OpenClaw setup."
    ],
    "agentAssist": {
      "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
      "steps": [
        "Download the package from Yavira.",
        "Extract it into a folder your agent can access.",
        "Paste one of the prompts below and point your agent at the extracted folder."
      ],
      "prompts": [
        {
          "label": "New install",
          "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
        },
        {
          "label": "Upgrade existing",
          "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
        }
      ]
    },
    "sourceHealth": {
      "source": "tencent",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-04-30T16:55:25.780Z",
      "expiresAt": "2026-05-07T16:55:25.780Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=network",
        "contentDisposition": "attachment; filename=\"network-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null
      },
      "scope": "source",
      "summary": "Source download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this source.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/mineru-pdf"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    },
    "downloadPageUrl": "https://openagent3.xyz/downloads/mineru-pdf",
    "agentPageUrl": "https://openagent3.xyz/skills/mineru-pdf/agent",
    "manifestUrl": "https://openagent3.xyz/skills/mineru-pdf/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/mineru-pdf/agent.md"
  },
  "agentAssist": {
    "summary": "Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.",
    "steps": [
      "Download the package from Yavira.",
      "Extract it into a folder your agent can access.",
      "Paste one of the prompts below and point your agent at the extracted folder."
    ],
    "prompts": [
      {
        "label": "New install",
        "body": "I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete."
      },
      {
        "label": "Upgrade existing",
        "body": "I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run."
      }
    ]
  },
  "documentation": {
    "source": "clawhub",
    "primaryDoc": "SKILL.md",
    "sections": [
      {
        "title": "MinerU PDF Parser",
        "body": "Parse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon."
      },
      {
        "title": "Option 1: Install MinerU MCP (for Claude Code)",
        "body": "claude mcp add --transport stdio --scope user mineru -- \\\n  uvx --from mcp-mineru python -m mcp_mineru.server\n\nThis installs and configures MinerU for all Claude projects. Models are downloaded on first use."
      },
      {
        "title": "Option 2: Use Direct Tool (preserves files)",
        "body": "The skill includes a direct parsing tool that saves output to a persistent directory:\n\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options]\n\nAdvantages:\n\n✅ Files are saved permanently (not auto-deleted)\n✅ Full control over output location\n✅ No MCP overhead\n✅ Works with any Python environment that has MinerU"
      },
      {
        "title": "Method 1: Using the Direct Tool (Recommended)",
        "body": "# Parse entire PDF\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\"\n\n# Parse specific pages\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\" \\\n  --start-page 0 --end-page 2\n\n# Use Apple Silicon optimization\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\" \\\n  --backend vlm-mlx-engine\n\n# Text only (faster)\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\" \\\n  --no-table --no-formula"
      },
      {
        "title": "Parse a PDF document",
        "body": "uvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'pipeline',\n            'formula_enable': True,\n            'table_enable': True,\n            'start_page': 0,\n            'end_page': -1  # -1 for all pages\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\""
      },
      {
        "title": "Check system capabilities",
        "body": "uvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def list_backends():\n    result = await call_tool(\n        name='list_backends',\n        arguments={}\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(list_backends())\n\""
      },
      {
        "title": "parse_pdf",
        "body": "Required:\n\nfile_path - Absolute path to the PDF file\n\nOptional:\n\nbackend - Processing backend (default: pipeline)\n\npipeline - Fast, general-purpose (recommended)\nvlm-mlx-engine - Fastest on Apple Silicon (M1/M2/M3/M4)\nvlm-transformers - Slowest but most accurate\n\n\nformula_enable - Enable formula recognition (default: true)\ntable_enable - Enable table recognition (default: true)\nstart_page - Starting page (0-indexed, default: 0)\nend_page - Ending page (default: -1 for all pages)"
      },
      {
        "title": "list_backends",
        "body": "No parameters required. Returns system information and backend recommendations."
      },
      {
        "title": "Extract tables from a specific page range",
        "body": "uvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'pipeline',\n            'table_enable': True,\n            'start_page': 5,\n            'end_page': 10\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\""
      },
      {
        "title": "Parse with formula recognition only (faster)",
        "body": "uvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'vlm-mlx-engine',\n            'formula_enable': True,\n            'table_enable': False  # Disable for speed\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\""
      },
      {
        "title": "Parse single page (fastest for testing)",
        "body": "uvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'pipeline',\n            'formula_enable': False,\n            'table_enable': False,\n            'start_page': 0,\n            'end_page': 0\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\""
      },
      {
        "title": "Performance",
        "body": "On Apple Silicon M4 (16GB RAM):\n\npipeline: ~32s/page, CPU-only, good quality\nvlm-mlx-engine: ~38s/page, Apple Silicon optimized, excellent quality\nvlm-transformers: ~148s/page, highest quality, slowest\n\nNote: First run downloads models (can take 5-10 minutes). Models are cached in ~/.cache/uv/ for faster subsequent runs."
      },
      {
        "title": "Output Format",
        "body": "Returns structured Markdown with:\n\nDocument metadata (file, backend, pages, settings)\nExtracted text with preserved structure\nTables formatted as Markdown tables\nFormulas converted to LaTeX"
      },
      {
        "title": "Supported Formats",
        "body": "PDF documents (.pdf)\nJPEG images (.jpg, .jpeg)\nPNG images (.png)\nOther image formats (WebP, GIF, etc.)"
      },
      {
        "title": "Module not found error",
        "body": "If you get \"No module named 'mcp_mineru'\", make sure you installed it:\n\nclaude mcp add --transport stdio --scope user mineru -- \\\n  uvx --from mcp-mineru python -m mcp_mineru.server"
      },
      {
        "title": "Slow processing on first run",
        "body": "This is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster."
      },
      {
        "title": "Timeout errors",
        "body": "Increase timeout for large documents or use smaller page ranges for testing."
      },
      {
        "title": "Notes",
        "body": "Output is returned as Markdown text\nTables are preserved in Markdown format\nMathematical formulas are converted to LaTeX\nWorks with scanned documents (OCR built-in)\nOptimized for Apple Silicon (M1/M2/M3/M4) with MLX backend"
      },
      {
        "title": "Why Files Get Deleted (MCP Method)",
        "body": "The MinerU MCP server uses Python's tempfile.TemporaryDirectory(), which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating."
      },
      {
        "title": "How to Preserve Files",
        "body": "Method A: Use the Direct Tool (Recommended)\n\nThe skill provides parse.py which saves files to a persistent directory:\n\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  /path/to/input.pdf \\\n  /path/to/output_dir\n\nAdvantages:\n\n✅ Files are never auto-deleted\n✅ Full control over output location\n✅ Can be used in batch processing\n✅ No MCP connection needed\n\nGenerated Structure:\n\n/path/to/output_dir/\n├── input.pdf_name/\n│   └── auto/          # or vlm/ depending on backend\n│       ├── input.pdf_name.md\n│       └── images/\n│           └── *.jpg\n└── input.pdf_name_parsed.md  # Copy at root for easy access\n\nMethod B: Redirect MCP Output\n\nIf using the MCP method, capture the output and save it:\n\n# Capture to file\nclaude -p \"Parse this PDF: /path/to/file.pdf\" > /tmp/output.md\n\n# Or use within a script that saves the result"
      },
      {
        "title": "Comparison",
        "body": "FeatureDirect ToolMCP MethodFiles persisted✅ Yes❌ No (auto-deleted)Custom output dir✅ Yes❌ No (temp only)Claude Code integration⚠️ Manual✅ NativeSpeed✅ Fast⚠️ MCP overheadOffline use✅ Yes⚠️ Needs Claude Code"
      },
      {
        "title": "Recommendation",
        "body": "Use Direct Tool when you need to keep the files for later use\nUse MCP Method when working within Claude Code and only need the text content"
      }
    ],
    "body": "MinerU PDF Parser\n\nParse PDF documents using MinerU MCP to extract structured content including text, tables, and formulas with MLX acceleration on Apple Silicon.\n\nInstallation\nOption 1: Install MinerU MCP (for Claude Code)\nclaude mcp add --transport stdio --scope user mineru -- \\\n  uvx --from mcp-mineru python -m mcp_mineru.server\n\n\nThis installs and configures MinerU for all Claude projects. Models are downloaded on first use.\n\nOption 2: Use Direct Tool (preserves files)\n\nThe skill includes a direct parsing tool that saves output to a persistent directory:\n\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py <pdf_path> <output_dir> [options]\n\n\nAdvantages:\n\n✅ Files are saved permanently (not auto-deleted)\n✅ Full control over output location\n✅ No MCP overhead\n✅ Works with any Python environment that has MinerU\nQuick Start\nMethod 1: Using the Direct Tool (Recommended)\n# Parse entire PDF\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\"\n\n# Parse specific pages\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\" \\\n  --start-page 0 --end-page 2\n\n# Use Apple Silicon optimization\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\" \\\n  --backend vlm-mlx-engine\n\n# Text only (faster)\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  \"/path/to/document.pdf\" \\\n  \"/path/to/output\" \\\n  --no-table --no-formula\n\nMethod 2: Using MinerU MCP (Temporary Files)\nParse a PDF document\nuvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'pipeline',\n            'formula_enable': True,\n            'table_enable': True,\n            'start_page': 0,\n            'end_page': -1  # -1 for all pages\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\"\n\nCheck system capabilities\nuvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def list_backends():\n    result = await call_tool(\n        name='list_backends',\n        arguments={}\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(list_backends())\n\"\n\nParameters\nparse_pdf\n\nRequired:\n\nfile_path - Absolute path to the PDF file\n\nOptional:\n\nbackend - Processing backend (default: pipeline)\npipeline - Fast, general-purpose (recommended)\nvlm-mlx-engine - Fastest on Apple Silicon (M1/M2/M3/M4)\nvlm-transformers - Slowest but most accurate\nformula_enable - Enable formula recognition (default: true)\ntable_enable - Enable table recognition (default: true)\nstart_page - Starting page (0-indexed, default: 0)\nend_page - Ending page (default: -1 for all pages)\nlist_backends\n\nNo parameters required. Returns system information and backend recommendations.\n\nUsage Examples\nExtract tables from a specific page range\nuvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'pipeline',\n            'table_enable': True,\n            'start_page': 5,\n            'end_page': 10\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\"\n\nParse with formula recognition only (faster)\nuvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'vlm-mlx-engine',\n            'formula_enable': True,\n            'table_enable': False  # Disable for speed\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\"\n\nParse single page (fastest for testing)\nuvx --from mcp-mineru python -c \"\nimport asyncio\nfrom mcp_mineru.server import call_tool\n\nasync def parse_pdf():\n    result = await call_tool(\n        name='parse_pdf',\n        arguments={\n            'file_path': '/path/to/document.pdf',\n            'backend': 'pipeline',\n            'formula_enable': False,\n            'table_enable': False,\n            'start_page': 0,\n            'end_page': 0\n        }\n    )\n    if hasattr(result, 'content'):\n        for item in result.content:\n            if hasattr(item, 'text'):\n                print(item.text)\n                break\n\nasyncio.run(parse_pdf())\n\"\n\nPerformance\n\nOn Apple Silicon M4 (16GB RAM):\n\npipeline: ~32s/page, CPU-only, good quality\nvlm-mlx-engine: ~38s/page, Apple Silicon optimized, excellent quality\nvlm-transformers: ~148s/page, highest quality, slowest\n\nNote: First run downloads models (can take 5-10 minutes). Models are cached in ~/.cache/uv/ for faster subsequent runs.\n\nOutput Format\n\nReturns structured Markdown with:\n\nDocument metadata (file, backend, pages, settings)\nExtracted text with preserved structure\nTables formatted as Markdown tables\nFormulas converted to LaTeX\nSupported Formats\nPDF documents (.pdf)\nJPEG images (.jpg, .jpeg)\nPNG images (.png)\nOther image formats (WebP, GIF, etc.)\nTroubleshooting\nModule not found error\n\nIf you get \"No module named 'mcp_mineru'\", make sure you installed it:\n\nclaude mcp add --transport stdio --scope user mineru -- \\\n  uvx --from mcp-mineru python -m mcp_mineru.server\n\nSlow processing on first run\n\nThis is normal. MinerU downloads ML models on first use. Subsequent runs will be much faster.\n\nTimeout errors\n\nIncrease timeout for large documents or use smaller page ranges for testing.\n\nNotes\nOutput is returned as Markdown text\nTables are preserved in Markdown format\nMathematical formulas are converted to LaTeX\nWorks with scanned documents (OCR built-in)\nOptimized for Apple Silicon (M1/M2/M3/M4) with MLX backend\nFile Persistence\nWhy Files Get Deleted (MCP Method)\n\nThe MinerU MCP server uses Python's tempfile.TemporaryDirectory(), which automatically deletes files when the context exits. This is by design to prevent temporary files from accumulating.\n\nHow to Preserve Files\n\nMethod A: Use the Direct Tool (Recommended)\n\nThe skill provides parse.py which saves files to a persistent directory:\n\npython /Users/lwj04/clawd/skills/mineru-pdf/parse.py \\\n  /path/to/input.pdf \\\n  /path/to/output_dir\n\n\nAdvantages:\n\n✅ Files are never auto-deleted\n✅ Full control over output location\n✅ Can be used in batch processing\n✅ No MCP connection needed\n\nGenerated Structure:\n\n/path/to/output_dir/\n├── input.pdf_name/\n│   └── auto/          # or vlm/ depending on backend\n│       ├── input.pdf_name.md\n│       └── images/\n│           └── *.jpg\n└── input.pdf_name_parsed.md  # Copy at root for easy access\n\n\nMethod B: Redirect MCP Output\n\nIf using the MCP method, capture the output and save it:\n\n# Capture to file\nclaude -p \"Parse this PDF: /path/to/file.pdf\" > /tmp/output.md\n\n# Or use within a script that saves the result\n\nComparison\nFeature\tDirect Tool\tMCP Method\nFiles persisted\t✅ Yes\t❌ No (auto-deleted)\nCustom output dir\t✅ Yes\t❌ No (temp only)\nClaude Code integration\t⚠️ Manual\t✅ Native\nSpeed\t✅ Fast\t⚠️ MCP overhead\nOffline use\t✅ Yes\t⚠️ Needs Claude Code\nRecommendation\nUse Direct Tool when you need to keep the files for later use\nUse MCP Method when working within Claude Code and only need the text content"
  },
  "trust": {
    "sourceLabel": "tencent",
    "provenanceUrl": "https://clawhub.ai/Etoile04/mineru-pdf",
    "publisherUrl": "https://clawhub.ai/Etoile04/mineru-pdf",
    "owner": "Etoile04",
    "version": "1.0.0",
    "license": null,
    "verificationStatus": "Indexed source record"
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/mineru-pdf",
    "downloadUrl": "https://openagent3.xyz/downloads/mineru-pdf",
    "agentUrl": "https://openagent3.xyz/skills/mineru-pdf/agent",
    "manifestUrl": "https://openagent3.xyz/skills/mineru-pdf/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/mineru-pdf/agent.md"
  }
}