What This Skill Does

The pdf-ocr-extractor skill serves as a robust utility for extracting textual content from documents that lack a native text layer. Many PDFs, especially those generated via physical scanners or mobile capture apps, consist merely of images rather than searchable characters. This skill bridges that gap by leveraging the Tesseract OCR engine. It converts PDF pages into high-resolution images using pypdfium2, processes them through pytesseract to identify text, and concatenates the findings into a clean, searchable string. Because it operates locally, it is completely free and requires no cloud-based API tokens, making it ideal for processing sensitive or private documentation.

Installation

To integrate this skill into your environment, use the OpenClaw command-line interface. Run the following command in your terminal:

clawhub install openclaw/skills/skills/bilicen700/pdf-ocr-extraction

Ensure that the Tesseract binary is installed on your host system (e.g., sudo apt-get install tesseract-ocr). Additionally, confirm that the necessary language packs (like eng or chi_sim) are installed on your OS, as the script relies on the system-level installation of these data files.

Use Cases

This skill is perfect for digitizing historical archives, extracting data from scanned receipts, or converting legacy paper forms into machine-readable text. It is particularly useful for researchers dealing with academic PDFs that were scanned as images and for professionals handling non-searchable legal or administrative paperwork.

Example Prompts

"Please extract all the text from the scanned invoice located at /home/user/docs/invoice_001.pdf using the OCR tool."
"I have a 50-page scanned document at ./archive.pdf. Can you perform OCR on it and summarize the contents for me?"
"Run the PDF OCR extractor on /mnt/data/scanned_report.pdf and save the resulting text to a new file named output.txt."

Tips & Limitations

Performance: Processing speed depends heavily on the resolution of your images and the number of pages. For very large PDFs, consider splitting the file into smaller chunks before running the skill.
Language Support: The script currently defaults to chi_sim+eng. You may need to adjust the language parameter in the script to match the specific languages within your source documents.
Cleanup: The tool is designed to write temporary files to /tmp/ and will automatically clean them up; ensure your environment allows write access to this directory.

pdf-ocr-extractor

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)