Requirements
- Target platform
- OpenClaw
- Install method
- Manual import
- Extraction
- Extract archive
- Prerequisites
- OpenClaw
- Primary doc
- SKILL.md
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
Vernox Utility Skill - Perfect for document digitization.
PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).
Extract text from PDFs without external tools Support for both text-based and scanned PDFs Preserve document structure and formatting Fast extraction (milliseconds for text-based)
Use Tesseract.js for scanned documents Support multiple languages (English, Spanish, French, German) Configurable OCR quality/speed Fallback to text extraction when possible
Process multiple PDFs at once Batch extraction for document workflows Progress tracking for large files Error handling and retry logic
Plain text output JSON output with metadata Markdown conversion HTML output (preserving links)
Page-by-page extraction Character/word counting Language detection Metadata extraction (author, title, creation date)
clawhub install pdf-text-extractor
const result = await extractText({ pdfPath: './document.pdf', options: { outputFormat: 'text', ocr: true, language: 'eng' } }); console.log(result.text); console.log(`Pages: ${result.pages}`); console.log(`Words: ${result.wordCount}`);
const results = await extractBatch({ pdfFiles: [ './document1.pdf', './document2.pdf', './document3.pdf' ], options: { outputFormat: 'json', ocr: true } }); console.log(`Extracted ${results.length} PDFs`);
const result = await extractText({ pdfPath: './scanned-document.pdf', options: { ocr: true, language: 'eng', ocrQuality: 'high' } }); // OCR will be used (scanned document detected)
Extract text content from a single PDF file. Parameters: pdfPath (string, required): Path to PDF file options (object, optional): Extraction options outputFormat (string): 'text' | 'json' | 'markdown' | 'html' ocr (boolean): Enable OCR for scanned docs language (string): OCR language code ('eng', 'spa', 'fra', 'deu') preserveFormatting (boolean): Keep headings/structure minConfidence (number): Minimum OCR confidence score (0-100) Returns: text (string): Extracted text content pages (number): Number of pages processed wordCount (number): Total word count charCount (number): Total character count language (string): Detected language metadata (object): PDF metadata (title, author, creation date) method (string): 'text' or 'ocr' (extraction method)
Extract text from multiple PDF files at once. Parameters: pdfFiles (array, required): Array of PDF file paths options (object, optional): Same as extractText Returns: results (array): Array of extraction results totalPages (number): Total pages across all PDFs successCount (number): Successfully extracted failureCount (number): Failed extractions errors (array): Error details for failures
Count words in extracted text. Parameters: text (string, required): Text to count options (object, optional): minWordLength (number): Minimum characters per word (default: 3) excludeNumbers (boolean): Don't count numbers as words countByPage (boolean): Return word count per page Returns: wordCount (number): Total word count charCount (number): Total character count pageCounts (array): Word count per page averageWordsPerPage (number): Average words per page
Detect the language of extracted text. Parameters: text (string, required): Text to analyze minConfidence (number): Minimum confidence for detection Returns: language (string): Detected language code languageName (string): Full language name confidence (number): Confidence score (0-100)
Convert paper documents to digital text Process invoices and receipts Digitize contracts and agreements Archive physical documents
Extract text for analysis tools Prepare content for LLM processing Clean up scanned documents Parse PDF-based reports
Extract data from PDF reports Parse tables from PDFs Pull structured data Automate document workflows
Prepare content for translation Clean up OCR output Extract specific sections Search within PDF content
Speed: ~100ms for 10-page PDF Accuracy: 100% (exact text) Memory: ~10MB for typical document
Speed: ~1-3s per page (high quality) Accuracy: 85-95% (depends on scan quality) Memory: ~50-100MB peak during OCR
Uses native PDF.js library Extracts text layer directly (no OCR needed) Preserves document structure Handles password-protected PDFs
Tesseract.js under the hood Supports 100+ languages Adjustable quality/speed tradeoff Confidence scoring for accuracy
ZERO external dependencies Uses Node.js built-in modules only PDF.js included in skill Tesseract.js bundled
Clear error message Suggest fix (check file format) Skip to next file in batch
Report confidence score Suggest rescan at higher quality Fallback to basic extraction
Stream processing for large files Progress reporting Graceful degradation
{ "ocr": { "enabled": true, "defaultLanguage": "eng", "quality": "medium", "languages": ["eng", "spa", "fra", "deu"] }, "output": { "defaultFormat": "text", "preserveFormatting": true, "includeMetadata": true }, "batch": { "maxConcurrent": 3, "timeoutSeconds": 30 } }
const invoice = await extractText('./invoice.pdf'); console.log(invoice.text); // "INVOICE #12345 Date: 2026-02-04..."
const contract = await extractText('./scanned-contract.pdf', { ocr: true, language: 'eng', ocrQuality: 'high' }); console.log(contract.text); // "AGREEMENT This contract between..."
const docs = await extractBatch([ './doc1.pdf', './doc2.pdf', './doc3.pdf', './doc4.pdf' ]); console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);
Check if PDF is truly scanned (not text-based) Try different quality settings (low/medium/high) Ensure language matches document Check image quality of scan
PDF may be image-only OCR failed with low confidence Try different language setting
Large PDF takes longer Reduce quality for speed Process in smaller batches
Use text-based PDFs when possible (faster, 100% accurate) High-quality scans for OCR (300 DPI+) Clean background before scanning Use correct language setting
Batch processing for multiple files Disable OCR for text-based PDFs Lower OCR quality for speed when acceptable
PDF/A support Advanced OCR pre-processing Table extraction from OCR Handwriting OCR PDF form field extraction Batch language detection Confidence scoring visualization
MIT Extract text from PDFs. Fast, accurate, zero dependencies. ๐ฎ
Code helpers, APIs, CLIs, browser automation, testing, and developer operations.
Largest current source with strong distribution and engagement signals.