Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Summarize what changed and any follow-up checks I should run.
Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
TopicFileCode examplesexamples.mdOCR setupocr.mdTroubleshootingtroubleshooting.md
pip install PyMuPDF Import as fitz (historical name): import fitz # PyMuPDF
import fitz doc = fitz.open("document.pdf") text = "" for page in doc: text += page.get_text() doc.close()
PDF TypeMethodText-basedpage.get_text() โ fast, accurateScannedOCR with pytesseract โ slowerMixedCheck each page, use OCR when needed
def needs_ocr(page): text = page.get_text().strip() return len(text) < 50 # Likely scanned if very little text
try: doc = fitz.open(path) except fitz.FileDataError: print("Invalid or corrupted PDF") except fitz.PasswordError: doc = fitz.open(path, password="secret")
TrapWhat HappensFixOCR on text PDFSlow + worse accuracyCheck get_text() firstForget to close docMemory leakUse with or doc.close()Assume page orderWrong reading flowUse sort=True in get_text()Ignore encodingGarbled charactersPyMuPDF handles UTF-8
This skill provides instructions for using PyMuPDF to extract PDF text. This skill ONLY: Gives code examples for PyMuPDF Explains OCR setup when needed Troubleshoots common issues This skill NEVER: Accesses files without user request Sends data externally Modifies original PDFs
All processing is local: PyMuPDF runs entirely on your machine No external API calls No data leaves your system
text = page.get_text()
blocks = page.get_text("dict")["blocks"] for b in blocks: if b["type"] == 0: # text block for line in b["lines"]: for span in line["spans"]: print(span["text"], span["size"])
import json data = page.get_text("json") parsed = json.loads(data)
import fitz def extract_pdf(path): """Extract text from PDF, with OCR fallback for scanned pages.""" doc = fitz.open(path) results = [] for i, page in enumerate(doc): text = page.get_text() method = "text" # If very little text, might be scanned if len(text.strip()) < 50: # OCR would go here (see ocr.md) method = "needs_ocr" results.append({ "page": i + 1, "text": text, "method": method }) doc.close() return { "pages": len(results), "content": results, "word_count": sum(len(r["text"].split()) for r in results) } # Usage result = extract_pdf("document.pdf") print(f"Extracted {result['word_count']} words from {result['pages']} pages")
Useful? clawhub star extract-pdf-text Stay updated: clawhub sync
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.