ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified file management Safety 4/5

pdf-text-extractor

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

Why use this skill?

Easily extract text from any PDF with OpenClaw's pdf-text-extractor. Supports OCR, batch processing, and multiple output formats for efficient digital workflows.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/michael-laffin/pdf-text-extractor
Or

What This Skill Does

The pdf-text-extractor is a versatile Vernox Utility Skill designed for seamless document digitization and analysis within the OpenClaw environment. It serves as a powerful bridge between static PDF files and actionable data. Whether you are dealing with text-based digital documents or image-heavy scanned PDFs, this tool provides a unified pipeline for content extraction. By leveraging Tesseract.js for robust OCR capabilities, the skill automatically detects non-selectable text and converts it into machine-readable formats. It goes beyond simple extraction by providing structured metadata, multi-language support, and flexible output formats including plain text, JSON, Markdown, and HTML, making it an essential tool for complex document processing workflows.

Installation

To integrate this utility into your OpenClaw agent, execute the following command in your terminal:

clawhub install openclaw/skills/skills/michael-laffin/pdf-text-extractor

Ensure that your environment allows for local file access, as the skill operates by reading source documents directly from your specified directories.

Use Cases

  • Automated Invoicing: Efficiently extract line items, dates, and total amounts from scanned vendor invoices to feed into your accounting software.
  • Document Archiving: Digitize legacy physical documents by converting them into searchable, structured Markdown or HTML files for your internal knowledge base.
  • Data Analysis: Quickly aggregate content from hundreds of PDFs for corpus analysis, metadata logging, or trend tracking across large document sets.
  • Accessibility: Convert image-only PDFs into readable, text-based formats for screen readers and search indexing.

Example Prompts

  1. "Please scan the invoice located at ./invoices/january.pdf, extract the total amount as JSON, and save the metadata for my records."
  2. "Go through the folder ./reports, process all PDFs using high-quality OCR, and generate a combined Markdown summary for each file."
  3. "Extract all text from the manual at ./docs/manual.pdf and format it as HTML, preserving the original headings and link structures."

Tips & Limitations

  • OCR Quality: For best results with scanned documents, ensure you use the ocrQuality: 'high' option, though note this will increase processing time.
  • Language Settings: Always specify the language code (e.g., 'eng', 'fra') if you are working with non-English documents to improve character recognition accuracy.
  • File Size: While the skill handles batch processing efficiently, processing extremely large multi-hundred-page documents may require significant system memory; consider splitting these files if you encounter performance bottlenecks.
  • Fallback Logic: The tool automatically attempts text extraction first; use the ocr: true flag explicitly if you know your document contains flattened images or scans to skip unnecessary processing steps.

Metadata

Stars1401
Views0
Updated2026-02-24
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-michael-laffin-pdf-text-extractor": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#pdf#ocr#digitization#extraction#document-processing
Safety Score: 4/5

Flags: file-read