Official Verified

Extract PDF Text

Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ivangdavila/extract-pdf-text

Download Source Code (.zip)

When to Use

Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.

Quick Reference

Topic	File
Code examples	`examples.md`
OCR setup	`ocr.md`
Troubleshooting	`troubleshooting.md`

Core Rules

1. Install PyMuPDF First

pip install PyMuPDF

Import as fitz (historical name):

import fitz  # PyMuPDF

2. Basic Text Extraction

import fitz

doc = fitz.open("document.pdf")
text = ""
for page in doc:
    text += page.get_text()
doc.close()

3. Pick the Right Method

PDF Type	Method
Text-based	`page.get_text()` — fast, accurate
Scanned	OCR with pytesseract — slower
Mixed	Check each page, use OCR when needed

4. Check for Text Before OCR

def needs_ocr(page):
    text = page.get_text().strip()
    return len(text) < 50  # Likely scanned if very little text

5. Handle Errors Gracefully

try:
    doc = fitz.open(path)
except fitz.FileDataError:
    print("Invalid or corrupted PDF")
except fitz.PasswordError:
    doc = fitz.open(path, password="secret")

Extraction Traps

Trap	What Happens	Fix
OCR on text PDF	Slow + worse accuracy	Check `get_text()` first
Forget to close doc	Memory leak	Use `with` or `doc.close()`
Assume page order	Wrong reading flow	Use `sort=True` in get_text()
Ignore encoding	Garbled characters	PyMuPDF handles UTF-8

Scope

This skill provides instructions for using PyMuPDF to extract PDF text.

This skill ONLY:

Gives code examples for PyMuPDF
Explains OCR setup when needed
Troubleshoots common issues

This skill NEVER:

Accesses files without user request
Sends data externally
Modifies original PDFs

Security & Privacy

All processing is local:

PyMuPDF runs entirely on your machine
No external API calls
No data leaves your system

Output Formats

Plain Text

text = page.get_text()

Structured (dict)

blocks = page.get_text("dict")["blocks"]
for b in blocks:
    if b["type"] == 0:  # text block
        for line in b["lines"]:
            for span in line["spans"]:
                print(span["text"], span["size"])

JSON

import json
data = page.get_text("json")
parsed = json.loads(data)

Full Example

import fitz

Read Full Documentation on GitHub

Metadata

Author@ivangdavila

Stars2102

Updated2026-03-06

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ivangdavila-extract-pdf-text": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

Animations

Create performant web animations with proper accessibility and timing.

ivangdavila 2190

Arduino

Develop Arduino projects avoiding common wiring, power, and code pitfalls.

ivangdavila 2190

Bulgarian

Write Bulgarian that sounds human. Not formal, not robotic, not AI-generated.

ivangdavila 2190

Arabic

Write Arabic that sounds human. Not formal, not robotic, not AI-generated.

ivangdavila 2190

Assistant

Manage tasks, communications, and scheduling with proactive and organized support.

ivangdavila 2190