pdf-miner
Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/extract content from a PDF file, (2) User needs text or tables from industry reports, research papers, or financial documents, (3) web_fetch or scrapling fail on a PDF. Supports: keyword search, metrics extraction, table of contents detection, PDF diff/comparison, LLM chunk splitting, batch processing, header/footer cleaning. NOT for: OCR on scanned image-based PDFs, editing/merging PDFs, or creating new PDFs.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/baichenwzj/pdf-minerPDF Miner Skill
Extract text and tables from PDF files using pdfplumber (global market formats).
Prerequisites
python -m pip install pdfplumber
For OCR capabilities (scanned/image PDFs), also install:
python -m pip install pymupdf openai
Initial Setup for OCR
Before using --ocr, you must provide a vision API credential. There are three ways:
-
Environment variables (recommended for temporary use):
export OCR_API_KEY="your-openrouter-api-key" export OCR_MODEL="qwen/qwen3.6-plus:free" # optional export OCR_BASE_URL="https://openrouter.ai/api/v1" # optional -
Config file (persistent, skill-specific):
Createskills/skills/pdf-miner/config.jsonwith:{ "vision_api_key": "your-openrouter-api-key", "vision_model": "qwen/qwen3.6-plus:free", "vision_base_url": "https://openrouter.ai/api/v1" } -
Command-line arguments (override per invocation):
python scripts/extract_pdf.py scanned.pdf --ocr --ocr-api-key "sk-..." --ocr-model "stepfun/step-3.5-flash:free"
Usage
Run commands from this skill directory.
Basic Extraction
# Full extraction (text + tables)
python scripts/extract_pdf.py input.pdf
# Output to custom path
python scripts/extract_pdf.py input.pdf output.md
# Specific pages
python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20
# Text or tables only
python scripts/extract_pdf.py input.pdf --text-only
python scripts/extract_pdf.py input.pdf --tables-only
python scripts/extract_pdf.py input.pdf --tables-only --json
Advanced Modes
# Search: find pages containing keywords with context
python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration"
# Metrics: extract lines with keywords + numeric values
python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration"
# TOC: extract table of contents / chapter structure (robust, multi-format)
python scripts/extract_pdf.py report.pdf --toc
# Optionally adjust sensitivity (default: 3 entries per page required)
python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2
# Diff: compare two PDFs, show pages unique to each
python scripts/extract_pdf.py old_version.pdf new_version.pdf --diff
# Chunk: split output into LLM-friendly chunks
python scripts/extract_pdf.py report.pdf --chunk # single file, 8000 chars each
python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000
python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks # separate files
# Clean headers/footers
python scripts/extract_pdf.py report.pdf --clean-headers
# Batch: process multiple PDFs
python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted
OCR for Scanned/Image PDFs (Automatic by Default)
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-baichenwzj-pdf-miner": {
"enabled": true,
"auto_update": true
}
}
}