Extract PDF Text
Extract text from PDF files using PyMuPDF. Parse tables, forms, and complex layouts. Supports OCR for scanned documents.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/ivangdavila/extract-pdf-textWhen to Use
Agent needs to extract text from PDFs. Use PyMuPDF (fitz) for fast local extraction. Works with text-based documents, scanned pages with OCR, forms, and complex layouts.
Quick Reference
| Topic | File |
|---|---|
| Code examples | examples.md |
| OCR setup | ocr.md |
| Troubleshooting | troubleshooting.md |
Core Rules
1. Install PyMuPDF First
pip install PyMuPDF
Import as fitz (historical name):
import fitz # PyMuPDF
2. Basic Text Extraction
import fitz
doc = fitz.open("document.pdf")
text = ""
for page in doc:
text += page.get_text()
doc.close()
3. Pick the Right Method
| PDF Type | Method |
|---|---|
| Text-based | page.get_text() — fast, accurate |
| Scanned | OCR with pytesseract — slower |
| Mixed | Check each page, use OCR when needed |
4. Check for Text Before OCR
def needs_ocr(page):
text = page.get_text().strip()
return len(text) < 50 # Likely scanned if very little text
5. Handle Errors Gracefully
try:
doc = fitz.open(path)
except fitz.FileDataError:
print("Invalid or corrupted PDF")
except fitz.PasswordError:
doc = fitz.open(path, password="secret")
Extraction Traps
| Trap | What Happens | Fix |
|---|---|---|
| OCR on text PDF | Slow + worse accuracy | Check get_text() first |
| Forget to close doc | Memory leak | Use with or doc.close() |
| Assume page order | Wrong reading flow | Use sort=True in get_text() |
| Ignore encoding | Garbled characters | PyMuPDF handles UTF-8 |
Scope
This skill provides instructions for using PyMuPDF to extract PDF text.
This skill ONLY:
- Gives code examples for PyMuPDF
- Explains OCR setup when needed
- Troubleshoots common issues
This skill NEVER:
- Accesses files without user request
- Sends data externally
- Modifies original PDFs
Security & Privacy
All processing is local:
- PyMuPDF runs entirely on your machine
- No external API calls
- No data leaves your system
Output Formats
Plain Text
text = page.get_text()
Structured (dict)
blocks = page.get_text("dict")["blocks"]
for b in blocks:
if b["type"] == 0: # text block
for line in b["lines"]:
for span in line["spans"]:
print(span["text"], span["size"])
JSON
import json
data = page.get_text("json")
parsed = json.loads(data)
Full Example
import fitz
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-ivangdavila-extract-pdf-text": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Animations
Create performant web animations with proper accessibility and timing.
Arduino
Develop Arduino projects avoiding common wiring, power, and code pitfalls.
Bulgarian
Write Bulgarian that sounds human. Not formal, not robotic, not AI-generated.
Arabic
Write Arabic that sounds human. Not formal, not robotic, not AI-generated.
Assistant
Manage tasks, communications, and scheduling with proactive and organized support.