# Send PDF Text Extractor to your agent
Hand the extracted package to your coding agent with a concrete install brief instead of figuring it out manually.
## Fast path
- Download the package from Yavira.
- Extract it into a folder your agent can access.
- Paste one of the prompts below and point your agent at the extracted folder.
## Suggested prompts
### New install

```text
I downloaded a skill package from Yavira. Read SKILL.md from the extracted folder and install it by following the included instructions. Then review README.md for any prerequisites, environment setup, or post-install checks. Tell me what you changed and call out any manual steps you could not complete.
```
### Upgrade existing

```text
I downloaded an updated skill package from Yavira. Read SKILL.md from the extracted folder, compare it with my current installation, and upgrade it while preserving any custom configuration unless the package docs explicitly say otherwise. Then review README.md for any prerequisites, environment setup, or post-install checks. Summarize what changed and any follow-up checks I should run.
```
## Machine-readable fields
```json
{
  "schemaVersion": "1.0",
  "item": {
    "slug": "pdf-text-extractor",
    "name": "PDF Text Extractor",
    "source": "tencent",
    "type": "skill",
    "category": "开发工具",
    "sourceUrl": "https://clawhub.ai/Michael-laffin/pdf-text-extractor",
    "canonicalUrl": "https://clawhub.ai/Michael-laffin/pdf-text-extractor",
    "targetPlatform": "OpenClaw"
  },
  "install": {
    "downloadUrl": "/downloads/pdf-text-extractor",
    "sourceDownloadUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=pdf-text-extractor",
    "sourcePlatform": "tencent",
    "targetPlatform": "OpenClaw",
    "packageFormat": "ZIP package",
    "primaryDoc": "SKILL.md",
    "includedAssets": [
      "README.md",
      "SKILL.md",
      "config.json",
      "index.js",
      "package-lock.json",
      "package.json"
    ],
    "downloadMode": "redirect",
    "sourceHealth": {
      "source": "tencent",
      "slug": "pdf-text-extractor",
      "status": "healthy",
      "reason": "direct_download_ok",
      "recommendedAction": "download",
      "checkedAt": "2026-05-06T20:40:45.021Z",
      "expiresAt": "2026-05-13T20:40:45.021Z",
      "httpStatus": 200,
      "finalUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=pdf-text-extractor",
      "contentType": "application/zip",
      "probeMethod": "head",
      "details": {
        "probeUrl": "https://wry-manatee-359.convex.site/api/v1/download?slug=pdf-text-extractor",
        "contentDisposition": "attachment; filename=\"pdf-text-extractor-1.0.0.zip\"",
        "redirectLocation": null,
        "bodySnippet": null,
        "slug": "pdf-text-extractor"
      },
      "scope": "item",
      "summary": "Item download looks usable.",
      "detail": "Yavira can redirect you to the upstream package for this item.",
      "primaryActionLabel": "Download for OpenClaw",
      "primaryActionHref": "/downloads/pdf-text-extractor"
    },
    "validation": {
      "installChecklist": [
        "Use the Yavira download entry.",
        "Review SKILL.md after the package is downloaded.",
        "Confirm the extracted package contains the expected setup assets."
      ],
      "postInstallChecks": [
        "Confirm the extracted package includes the expected docs or setup files.",
        "Validate the skill or prompts are available in your target agent workspace.",
        "Capture any manual follow-up steps the agent could not complete."
      ]
    }
  },
  "links": {
    "detailUrl": "https://openagent3.xyz/skills/pdf-text-extractor",
    "downloadUrl": "https://openagent3.xyz/downloads/pdf-text-extractor",
    "agentUrl": "https://openagent3.xyz/skills/pdf-text-extractor/agent",
    "manifestUrl": "https://openagent3.xyz/skills/pdf-text-extractor/agent.json",
    "briefUrl": "https://openagent3.xyz/skills/pdf-text-extractor/agent.md"
  }
}
```
## Documentation

### PDF-Text-Extractor - Extract Text from PDFs

Vernox Utility Skill - Perfect for document digitization.

### Overview

PDF-Text-Extractor is a zero-dependency tool for extracting text content from PDF files. Supports both embedded text extraction (for text-based PDFs) and OCR (for scanned documents).

### ✅ Text Extraction

Extract text from PDFs without external tools
Support for both text-based and scanned PDFs
Preserve document structure and formatting
Fast extraction (milliseconds for text-based)

### ✅ OCR Support

Use Tesseract.js for scanned documents
Support multiple languages (English, Spanish, French, German)
Configurable OCR quality/speed
Fallback to text extraction when possible

### ✅ Batch Processing

Process multiple PDFs at once
Batch extraction for document workflows
Progress tracking for large files
Error handling and retry logic

### ✅ Output Options

Plain text output
JSON output with metadata
Markdown conversion
HTML output (preserving links)

### ✅ Utility Features

Page-by-page extraction
Character/word counting
Language detection
Metadata extraction (author, title, creation date)

### Installation

clawhub install pdf-text-extractor

### Extract Text from PDF

const result = await extractText({
  pdfPath: './document.pdf',
  options: {
    outputFormat: 'text',
    ocr: true,
    language: 'eng'
  }
});

console.log(result.text);
console.log(\`Pages: ${result.pages}\`);
console.log(\`Words: ${result.wordCount}\`);

### Batch Extract Multiple PDFs

const results = await extractBatch({
  pdfFiles: [
    './document1.pdf',
    './document2.pdf',
    './document3.pdf'
  ],
  options: {
    outputFormat: 'json',
    ocr: true
  }
});

console.log(\`Extracted ${results.length} PDFs\`);

### Extract with OCR

const result = await extractText({
  pdfPath: './scanned-document.pdf',
  options: {
    ocr: true,
    language: 'eng',
    ocrQuality: 'high'
  }
});

// OCR will be used (scanned document detected)

### extractText

Extract text content from a single PDF file.

Parameters:

pdfPath (string, required): Path to PDF file
options (object, optional): Extraction options

outputFormat (string): 'text' | 'json' | 'markdown' | 'html'
ocr (boolean): Enable OCR for scanned docs
language (string): OCR language code ('eng', 'spa', 'fra', 'deu')
preserveFormatting (boolean): Keep headings/structure
minConfidence (number): Minimum OCR confidence score (0-100)

Returns:

text (string): Extracted text content
pages (number): Number of pages processed
wordCount (number): Total word count
charCount (number): Total character count
language (string): Detected language
metadata (object): PDF metadata (title, author, creation date)
method (string): 'text' or 'ocr' (extraction method)

### extractBatch

Extract text from multiple PDF files at once.

Parameters:

pdfFiles (array, required): Array of PDF file paths
options (object, optional): Same as extractText

Returns:

results (array): Array of extraction results
totalPages (number): Total pages across all PDFs
successCount (number): Successfully extracted
failureCount (number): Failed extractions
errors (array): Error details for failures

### countWords

Count words in extracted text.

Parameters:

text (string, required): Text to count
options (object, optional):

minWordLength (number): Minimum characters per word (default: 3)
excludeNumbers (boolean): Don't count numbers as words
countByPage (boolean): Return word count per page

Returns:

wordCount (number): Total word count
charCount (number): Total character count
pageCounts (array): Word count per page
averageWordsPerPage (number): Average words per page

### detectLanguage

Detect the language of extracted text.

Parameters:

text (string, required): Text to analyze
minConfidence (number): Minimum confidence for detection

Returns:

language (string): Detected language code
languageName (string): Full language name
confidence (number): Confidence score (0-100)

### Document Digitization

Convert paper documents to digital text
Process invoices and receipts
Digitize contracts and agreements
Archive physical documents

### Content Analysis

Extract text for analysis tools
Prepare content for LLM processing
Clean up scanned documents
Parse PDF-based reports

### Data Extraction

Extract data from PDF reports
Parse tables from PDFs
Pull structured data
Automate document workflows

### Text Processing

Prepare content for translation
Clean up OCR output
Extract specific sections
Search within PDF content

### Text-Based PDFs

Speed: ~100ms for 10-page PDF
Accuracy: 100% (exact text)
Memory: ~10MB for typical document

### OCR Processing

Speed: ~1-3s per page (high quality)
Accuracy: 85-95% (depends on scan quality)
Memory: ~50-100MB peak during OCR

### PDF Parsing

Uses native PDF.js library
Extracts text layer directly (no OCR needed)
Preserves document structure
Handles password-protected PDFs

### OCR Engine

Tesseract.js under the hood
Supports 100+ languages
Adjustable quality/speed tradeoff
Confidence scoring for accuracy

### Dependencies

ZERO external dependencies
Uses Node.js built-in modules only
PDF.js included in skill
Tesseract.js bundled

### Invalid PDF

Clear error message
Suggest fix (check file format)
Skip to next file in batch

### OCR Failure

Report confidence score
Suggest rescan at higher quality
Fallback to basic extraction

### Memory Issues

Stream processing for large files
Progress reporting
Graceful degradation

### Edit config.json:

{
  "ocr": {
    "enabled": true,
    "defaultLanguage": "eng",
    "quality": "medium",
    "languages": ["eng", "spa", "fra", "deu"]
  },
  "output": {
    "defaultFormat": "text",
    "preserveFormatting": true,
    "includeMetadata": true
  },
  "batch": {
    "maxConcurrent": 3,
    "timeoutSeconds": 30
  }
}

### Extract from Invoice

const invoice = await extractText('./invoice.pdf');
console.log(invoice.text);
// "INVOICE #12345 Date: 2026-02-04..."

### Extract from Scanned Contract

const contract = await extractText('./scanned-contract.pdf', {
  ocr: true,
  language: 'eng',
  ocrQuality: 'high'
});
console.log(contract.text);
// "AGREEMENT This contract between..."

### Batch Process Documents

const docs = await extractBatch([
  './doc1.pdf',
  './doc2.pdf',
  './doc3.pdf',
  './doc4.pdf'
]);
console.log(\`Processed ${docs.successCount}/${docs.results.length} documents\`);

### OCR Not Working

Check if PDF is truly scanned (not text-based)
Try different quality settings (low/medium/high)
Ensure language matches document
Check image quality of scan

### Extraction Returns Empty

PDF may be image-only
OCR failed with low confidence
Try different language setting

### Slow Processing

Large PDF takes longer
Reduce quality for speed
Process in smaller batches

### Best Results

Use text-based PDFs when possible (faster, 100% accurate)
High-quality scans for OCR (300 DPI+)
Clean background before scanning
Use correct language setting

### Performance Optimization

Batch processing for multiple files
Disable OCR for text-based PDFs
Lower OCR quality for speed when acceptable

### Roadmap

PDF/A support
 Advanced OCR pre-processing
 Table extraction from OCR
 Handwriting OCR
 PDF form field extraction
 Batch language detection
 Confidence scoring visualization

### License

MIT

Extract text from PDFs. Fast, accurate, zero dependencies. 🔮
## Trust
- Source: tencent
- Verification: Indexed source record
- Publisher: Michael-laffin
- Version: 1.0.0
## Source health
- Status: healthy
- Item download looks usable.
- Yavira can redirect you to the upstream package for this item.
- Health scope: item
- Reason: direct_download_ok
- Checked at: 2026-05-06T20:40:45.021Z
- Expires at: 2026-05-13T20:40:45.021Z
- Recommended action: Download for OpenClaw
## Links
- [Detail page](https://openagent3.xyz/skills/pdf-text-extractor)
- [Send to Agent page](https://openagent3.xyz/skills/pdf-text-extractor/agent)
- [JSON manifest](https://openagent3.xyz/skills/pdf-text-extractor/agent.json)
- [Markdown brief](https://openagent3.xyz/skills/pdf-text-extractor/agent.md)
- [Download page](https://openagent3.xyz/downloads/pdf-text-extractor)