pdf-text-extractor
Extract text from PDFs with OCR support. Perfect for digitizing documents, processing invoices, or analyzing content. Zero dependencies required.
Why use this skill?
Easily extract text from any PDF with OpenClaw's pdf-text-extractor. Supports OCR, batch processing, and multiple output formats for efficient digital workflows.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/michael-laffin/pdf-text-extractorWhat This Skill Does
The pdf-text-extractor is a versatile Vernox Utility Skill designed for seamless document digitization and analysis within the OpenClaw environment. It serves as a powerful bridge between static PDF files and actionable data. Whether you are dealing with text-based digital documents or image-heavy scanned PDFs, this tool provides a unified pipeline for content extraction. By leveraging Tesseract.js for robust OCR capabilities, the skill automatically detects non-selectable text and converts it into machine-readable formats. It goes beyond simple extraction by providing structured metadata, multi-language support, and flexible output formats including plain text, JSON, Markdown, and HTML, making it an essential tool for complex document processing workflows.
Installation
To integrate this utility into your OpenClaw agent, execute the following command in your terminal:
clawhub install openclaw/skills/skills/michael-laffin/pdf-text-extractor
Ensure that your environment allows for local file access, as the skill operates by reading source documents directly from your specified directories.
Use Cases
- Automated Invoicing: Efficiently extract line items, dates, and total amounts from scanned vendor invoices to feed into your accounting software.
- Document Archiving: Digitize legacy physical documents by converting them into searchable, structured Markdown or HTML files for your internal knowledge base.
- Data Analysis: Quickly aggregate content from hundreds of PDFs for corpus analysis, metadata logging, or trend tracking across large document sets.
- Accessibility: Convert image-only PDFs into readable, text-based formats for screen readers and search indexing.
Example Prompts
- "Please scan the invoice located at ./invoices/january.pdf, extract the total amount as JSON, and save the metadata for my records."
- "Go through the folder ./reports, process all PDFs using high-quality OCR, and generate a combined Markdown summary for each file."
- "Extract all text from the manual at ./docs/manual.pdf and format it as HTML, preserving the original headings and link structures."
Tips & Limitations
- OCR Quality: For best results with scanned documents, ensure you use the
ocrQuality: 'high'option, though note this will increase processing time. - Language Settings: Always specify the language code (e.g., 'eng', 'fra') if you are working with non-English documents to improve character recognition accuracy.
- File Size: While the skill handles batch processing efficiently, processing extremely large multi-hundred-page documents may require significant system memory; consider splitting these files if you encounter performance bottlenecks.
- Fallback Logic: The tool automatically attempts text extraction first; use the
ocr: trueflag explicitly if you know your document contains flattened images or scans to skip unnecessary processing steps.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-michael-laffin-pdf-text-extractor": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-read
Related Skills
price-tracker
Monitor product prices across Amazon, eBay, Walmart, and Best Buy to identify arbitrage opportunities and profit margins. Use when finding products to flip, monitoring competitor pricing, tracking price history, identifying arbitrage opportunities, or setting automated price alerts.
seo-article-gen
SEO-optimized article generator with automatic affiliate link integration. Generate high-ranking content with keyword research, structured data, and monetization built-in.
affiliate-master
Full-stack affiliate marketing automation for OpenClaw agents. Generate, track, and optimize affiliate links with FTC-compliant disclosures and multi-network support.
product-description-generator
Generate SEO-optimized product descriptions for e-commerce platforms (Amazon, Shopify, eBay, Etsy). Create compelling, conversion-focused copy with keywords, features, benefits, and calls-to-action. Use when creating product listings, optimizing existing descriptions, or generating bulk product copy.
review-summarizer
Scrape, analyze, and summarize product reviews from multiple platforms (Amazon, Google, Yelp, TripAdvisor). Extract key insights, sentiment analysis, pros/cons, and recommendations. Use when researching products for arbitrage, creating affiliate content, or making purchasing decisions.