What This Skill Does

The PDF OCR using Gemini LLM skill enables OpenClaw agents to perform high-fidelity optical character recognition (OCR) on PDF documents by leveraging Google Gemini's advanced multimodal vision capabilities. Unlike traditional OCR tools that often struggle with complex layouts, skewed scans, or handwritten notes, this skill processes each page of a PDF as an image, allowing Gemini to accurately interpret text, tables, and formatting. The tool automates the splitting of multi-page PDFs into individual components, uploads them securely to the Google API, and synthesizes the extracted content into readable text or structured JSON data for further processing.

Installation

To integrate this skill into your environment, navigate to your OpenClaw directory and execute the following command: clawhub install openclaw/skills/skills/ashtonizmev/geminipdfocr. Once installed, set up your local workspace by creating a virtual environment within the skill folder: cd geminipdfocr && python -m venv venv && source venv/bin/activate && pip install -r requirements.txt. Finally, ensure the GOOGLE_API_KEY is exported in your environment variables to authorize the API requests.

Use Cases

This skill is ideal for digitizing physical paperwork, processing legacy invoices, extracting data from scanned reports, or converting non-selectable text PDFs into actionable machine-readable formats. It is particularly effective for documents containing mixed elements like diagrams, handwritten annotations, and standard text blocks that would otherwise require manual entry.

Example Prompts

"OpenClaw, perform OCR on the scanned invoice located at /documents/invoices/inv_2023_09.pdf and save the results as a JSON file."
"Extract the text from the first five pages of /downloads/research_paper.pdf to help me summarize the findings."
"Run an OCR scan on /uploads/handwritten_notes.pdf and give me the output in a clean, plain text format."

Tips & Limitations

To manage costs and processing time, use the --max-pages flag when testing or working with exceptionally large documents. Remember that this tool sends file content to an external API; do not process highly sensitive or private information without ensuring compliance with your internal data security policies. For best results, ensure your PDF files are not password-protected before attempting to process them. Use the --json flag if you intend to pipe the output into other programmatic workflows or data analysis tools for post-processing.

PDF OCR using Gemini LLM

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)