Official Verified file management Safety 4/5

pdf-text-extractor

Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.

Why use this skill?

Easily extract text from any PDF with OpenClaw's pdf-text-extractor. Supports OCR, batch processing, and multiple output formats for efficient digital workflows.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/michael-laffin/pdf-text-extractor

Download Source Code (.zip)

What This Skill Does

The pdf-text-extractor is a versatile Vernox Utility Skill designed for seamless document digitization and analysis within the OpenClaw environment. It serves as a powerful bridge between static PDF files and actionable data. Whether you are dealing with text-based digital documents or image-heavy scanned PDFs, this tool provides a unified pipeline for content extraction. By leveraging Tesseract.js for robust OCR capabilities, the skill automatically detects non-selectable text and converts it into machine-readable formats. It goes beyond simple extraction by providing structured metadata, multi-language support, and flexible output formats including plain text, JSON, Markdown, and HTML, making it an essential tool for complex document processing workflows.

Installation

To integrate this utility into your OpenClaw agent, execute the following command in your terminal:

clawhub install openclaw/skills/skills/michael-laffin/pdf-text-extractor

Ensure that your environment allows for local file access, as the skill operates by reading source documents directly from your specified directories.

Use Cases

Automated Invoicing: Efficiently extract line items, dates, and total amounts from scanned vendor invoices to feed into your accounting software.
Document Archiving: Digitize legacy physical documents by converting them into searchable, structured Markdown or HTML files for your internal knowledge base.
Data Analysis: Quickly aggregate content from hundreds of PDFs for corpus analysis, metadata logging, or trend tracking across large document sets.
Accessibility: Convert image-only PDFs into readable, text-based formats for screen readers and search indexing.

Example Prompts

"Please scan the invoice located at ./invoices/january.pdf, extract the total amount as JSON, and save the metadata for my records."
"Go through the folder ./reports, process all PDFs using high-quality OCR, and generate a combined Markdown summary for each file."
"Extract all text from the manual at ./docs/manual.pdf and format it as HTML, preserving the original headings and link structures."

Tips & Limitations

OCR Quality: For best results with scanned documents, ensure you use the ocrQuality: 'high' option, though note this will increase processing time.
Language Settings: Always specify the language code (e.g., 'eng', 'fra') if you are working with non-English documents to improve character recognition accuracy.
File Size: While the skill handles batch processing efficiently, processing extremely large multi-hundred-page documents may require significant system memory; consider splitting these files if you encounter performance bottlenecks.
Fallback Logic: The tool automatically attempts text extraction first; use the ocr: true flag explicitly if you know your document contains flattened images or scans to skip unnecessary processing steps.

Read Full Documentation on GitHub

Metadata

Author@michael-laffin

Stars1401

Updated2026-02-24

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-michael-laffin-pdf-text-extractor": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#pdf#ocr#digitization#extraction#document-processing

Safety Score: 4/5

Flags: file-read

Related Skills

price-tracker

Monitor product prices across Amazon, eBay, Walmart, and Best Buy to identify arbitrage opportunities and profit margins. Use when finding products to flip, monitoring competitor pricing, tracking price history, identifying arbitrage opportunities, or setting automated price alerts.

michael-laffin 1401

seo-article-gen

SEO-optimized article generator with automatic affiliate link integration. Generate high-ranking content with keyword research, structured data, and monetization built-in.

michael-laffin 1401

affiliate-master

Full-stack affiliate marketing automation for OpenClaw agents. Generate, track, and optimize affiliate links with FTC-compliant disclosures and multi-network support.

michael-laffin 1401

product-description-generator

Generate SEO-optimized product descriptions for e-commerce platforms (Amazon, Shopify, eBay, Etsy). Create compelling, conversion-focused copy with keywords, features, benefits, and calls-to-action. Use when creating product listings, optimizing existing descriptions, or generating bulk product copy.

michael-laffin 1401

review-summarizer

Scrape, analyze, and summarize product reviews from multiple platforms (Amazon, Google, Yelp, TripAdvisor). Extract key insights, sentiment analysis, pros/cons, and recommendations. Use when researching products for arbitrage, creating affiliate content, or making purchasing decisions.

michael-laffin 1401