pdf-ocr-extractor
Extract text from image-based or scanned PDFs using Tesseract OCR.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/bilicen700/pdf-ocr-extractionWhat This Skill Does
The pdf-ocr-extractor skill serves as a robust utility for extracting textual content from documents that lack a native text layer. Many PDFs, especially those generated via physical scanners or mobile capture apps, consist merely of images rather than searchable characters. This skill bridges that gap by leveraging the Tesseract OCR engine. It converts PDF pages into high-resolution images using pypdfium2, processes them through pytesseract to identify text, and concatenates the findings into a clean, searchable string. Because it operates locally, it is completely free and requires no cloud-based API tokens, making it ideal for processing sensitive or private documentation.
Installation
To integrate this skill into your environment, use the OpenClaw command-line interface. Run the following command in your terminal:
clawhub install openclaw/skills/skills/bilicen700/pdf-ocr-extraction
Ensure that the Tesseract binary is installed on your host system (e.g., sudo apt-get install tesseract-ocr). Additionally, confirm that the necessary language packs (like eng or chi_sim) are installed on your OS, as the script relies on the system-level installation of these data files.
Use Cases
This skill is perfect for digitizing historical archives, extracting data from scanned receipts, or converting legacy paper forms into machine-readable text. It is particularly useful for researchers dealing with academic PDFs that were scanned as images and for professionals handling non-searchable legal or administrative paperwork.
Example Prompts
- "Please extract all the text from the scanned invoice located at /home/user/docs/invoice_001.pdf using the OCR tool."
- "I have a 50-page scanned document at ./archive.pdf. Can you perform OCR on it and summarize the contents for me?"
- "Run the PDF OCR extractor on /mnt/data/scanned_report.pdf and save the resulting text to a new file named output.txt."
Tips & Limitations
- Performance: Processing speed depends heavily on the resolution of your images and the number of pages. For very large PDFs, consider splitting the file into smaller chunks before running the skill.
- Language Support: The script currently defaults to
chi_sim+eng. You may need to adjust the language parameter in the script to match the specific languages within your source documents. - Cleanup: The tool is designed to write temporary files to
/tmp/and will automatically clean them up; ensure your environment allows write access to this directory.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-bilicen700-pdf-ocr-extraction": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-read, file-write, code-execution