โ† All skills
Tencent SkillHub ยท Developer Tools

PDF Text Extractor

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

skill openclawclawhub Free
0 Downloads
0 Stars
0 Installs
0 Score
High Signal

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

โฌ‡ 0 downloads โ˜… 0 stars Unverified but indexed

Install for OpenClaw

Quick setup
  1. Download the package from Yavira.
  2. Extract the archive and review SKILL.md first.
  3. Import or place the package into your OpenClaw setup.

Requirements

Target platform
OpenClaw
Install method
Manual import
Extraction
Extract archive
Prerequisites
OpenClaw
Primary doc
SKILL.md

Package facts

Download mode
Yavira redirect
Package format
ZIP package
Source platform
Tencent SkillHub
What's included
README.md, SKILL.md, config.json, index.js, package-lock.json, package.json

Validation

  • Use the Yavira download entry.
  • Review SKILL.md after the package is downloaded.
  • Confirm the extracted package contains the expected setup assets.

Install with your agent

Agent handoff

Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.

  1. Download the package from Yavira.
  2. Extract it into a folder your agent can access.
  3. Paste one of the prompts below and point your agent at the extracted folder.
New install

I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.

Upgrade existing

I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.

Trust & source

Release facts

Source
Tencent SkillHub
Verification
Indexed source record
Version
1.0.0

Documentation

ClawHub primary doc Primary doc: SKILL.md 38 sections Open source page

PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

โœ… Text Extraction

Extract text from PDFs without external tools Support for both text-based and scanned PDFs Preserve document structure and formatting Fast extraction (milliseconds for text-based)

โœ… OCR Support

Use Tesseract.js for scanned documents Support multiple languages (English, Spanish, French, German) Configurable OCR quality/speed Fallback to text extraction when possible

โœ… Batch Processing

Process multiple PDFs at once Batch extraction for document workflows Progress tracking for large files Error handling and retry logic

โœ… Output Options

Plain text output JSON output with metadata Markdown conversion HTML output (preserving links)

โœ… Utility Features

Page-by-page extraction Character/word counting Language detection Metadata extraction (author, title, creation date)

Installation

clawhub install pdf-text-extractor

Extract Text from PDF

const result = await extractText({ pdfPath: './document.pdf', options: { outputFormat: 'text', ocr: true, language: 'eng' } }); console.log(result.text); console.log(`Pages: ${result.pages}`); console.log(`Words: ${result.wordCount}`);

Batch Extract Multiple PDFs

const results = await extractBatch({ pdfFiles: [ './document1.pdf', './document2.pdf', './document3.pdf' ], options: { outputFormat: 'json', ocr: true } }); console.log(`Extracted ${results.length} PDFs`);

Extract with OCR

const result = await extractText({ pdfPath: './scanned-document.pdf', options: { ocr: true, language: 'eng', ocrQuality: 'high' } }); // OCR will be used (scanned document detected)

extractText

Extract text content from a single PDF file. Parameters: pdfPath (string, required): Path to PDF file options (object, optional): Extraction options outputFormat (string): 'text' | 'json' | 'markdown' | 'html' ocr (boolean): Enable OCR for scanned docs language (string): OCR language code ('eng', 'spa', 'fra', 'deu') preserveFormatting (boolean): Keep headings/structure minConfidence (number): Minimum OCR confidence score (0-100) Returns: text (string): Extracted text content pages (number): Number of pages processed wordCount (number): Total word count charCount (number): Total character count language (string): Detected language metadata (object): PDF metadata (title, author, creation date) method (string): 'text' or 'ocr' (extraction method)

extractBatch

Extract text from multiple PDF files at once. Parameters: pdfFiles (array, required): Array of PDF file paths options (object, optional): Same as extractText Returns: results (array): Array of extraction results totalPages (number): Total pages across all PDFs successCount (number): Successfully extracted failureCount (number): Failed extractions errors (array): Error details for failures

countWords

Count words in extracted text. Parameters: text (string, required): Text to count options (object, optional): minWordLength (number): Minimum characters per word (default: 3) excludeNumbers (boolean): Don't count numbers as words countByPage (boolean): Return word count per page Returns: wordCount (number): Total word count charCount (number): Total character count pageCounts (array): Word count per page averageWordsPerPage (number): Average words per page

detectLanguage

Detect the language of extracted text. Parameters: text (string, required): Text to analyze minConfidence (number): Minimum confidence for detection Returns: language (string): Detected language code languageName (string): Full language name confidence (number): Confidence score (0-100)

Document Digitization

Convert paper documents to digital text Process invoices and receipts Digitize contracts and agreements Archive physical documents

Content Analysis

Extract text for analysis tools Prepare content for LLM processing Clean up scanned documents Parse PDF-based reports

Data Extraction

Extract data from PDF reports Parse tables from PDFs Pull structured data Automate document workflows

Text Processing

Prepare content for translation Clean up OCR output Extract specific sections Search within PDF content

Text-Based PDFs

Speed: ~100ms for 10-page PDF Accuracy: 100% (exact text) Memory: ~10MB for typical document

OCR Processing

Speed: ~1-3s per page (high quality) Accuracy: 85-95% (depends on scan quality) Memory: ~50-100MB peak during OCR

PDF Parsing

Uses native PDF.js library Extracts text layer directly (no OCR needed) Preserves document structure Handles password-protected PDFs

OCR Engine

Tesseract.js under the hood Supports 100+ languages Adjustable quality/speed tradeoff Confidence scoring for accuracy

Dependencies

ZERO external dependencies Uses Node.js built-in modules only PDF.js included in skill Tesseract.js bundled

Invalid PDF

Clear error message Suggest fix (check file format) Skip to next file in batch

OCR Failure

Report confidence score Suggest rescan at higher quality Fallback to basic extraction

Memory Issues

Stream processing for large files Progress reporting Graceful degradation

Edit config.json:

{ "ocr": { "enabled": true, "defaultLanguage": "eng", "quality": "medium", "languages": ["eng", "spa", "fra", "deu"] }, "output": { "defaultFormat": "text", "preserveFormatting": true, "includeMetadata": true }, "batch": { "maxConcurrent": 3, "timeoutSeconds": 30 } }

Extract from Invoice

const invoice = await extractText('./invoice.pdf'); console.log(invoice.text); // "INVOICE #12345 Date: 2026-02-04..."

Extract from Scanned Contract

const contract = await extractText('./scanned-contract.pdf', { ocr: true, language: 'eng', ocrQuality: 'high' }); console.log(contract.text); // "AGREEMENT This contract between..."

Batch Process Documents

const docs = await extractBatch([ './doc1.pdf', './doc2.pdf', './doc3.pdf', './doc4.pdf' ]); console.log(`Processed ${docs.successCount}/${docs.results.length} documents`);

OCR Not Working

Check if PDF is truly scanned (not text-based) Try different quality settings (low/medium/high) Ensure language matches document Check image quality of scan

Extraction Returns Empty

PDF may be image-only OCR failed with low confidence Try different language setting

Slow Processing

Large PDF takes longer Reduce quality for speed Process in smaller batches

Best Results

Use text-based PDFs when possible (faster, 100% accurate) High-quality scans for OCR (300 DPI+) Clean background before scanning Use correct language setting

Performance Optimization

Batch processing for multiple files Disable OCR for text-based PDFs Lower OCR quality for speed when acceptable

Roadmap

PDF/A support Advanced OCR pre-processing Table extraction from OCR Handwriting OCR PDF form field extraction Batch language detection Confidence scoring visualization

License

MIT Extract text from PDFs. Fast, accurate, zero dependencies. ๐Ÿ”ฎ

Category context

Code helpers, APIs, CLIs, browser automation, testing, and developer operations.

Source: Tencent SkillHub

Largest current source with strong distribution and engagement signals.

Package contents

Included in package
3 Config2 Docs1 Scripts
  • SKILL.md Primary doc
  • README.md Docs
  • index.js Scripts
  • config.json Config
  • package-lock.json Config
  • package.json Config