PDF Miner Skill

Extract text and tables from PDF files using pdfplumber (global market formats).

Prerequisites

python -m pip install pdfplumber

For OCR capabilities (scanned/image PDFs), also install:

python -m pip install pymupdf openai

Initial Setup for OCR

Before using --ocr, you must provide a vision API credential. There are three ways:

Environment variables (recommended for temporary use):

export OCR_API_KEY="your-openrouter-api-key"
export OCR_MODEL="qwen/qwen3.6-plus:free"   # optional
export OCR_BASE_URL="https://openrouter.ai/api/v1"   # optional

Config file (persistent, skill-specific):
Create skills/skills/pdf-miner/config.json with:

{
  "vision_api_key": "your-openrouter-api-key",
  "vision_model": "qwen/qwen3.6-plus:free",
  "vision_base_url": "https://openrouter.ai/api/v1"
}

Command-line arguments (override per invocation):

python scripts/extract_pdf.py scanned.pdf --ocr --ocr-api-key "sk-..." --ocr-model "stepfun/step-3.5-flash:free"

Usage

Run commands from this skill directory.

Basic Extraction

# Full extraction (text + tables)
python scripts/extract_pdf.py input.pdf

# Output to custom path
python scripts/extract_pdf.py input.pdf output.md

# Specific pages
python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20

# Text or tables only
python scripts/extract_pdf.py input.pdf --text-only
python scripts/extract_pdf.py input.pdf --tables-only
python scripts/extract_pdf.py input.pdf --tables-only --json

Advanced Modes

# Search: find pages containing keywords with context
python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration"

# Metrics: extract lines with keywords + numeric values
python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration"

# TOC: extract table of contents / chapter structure (robust, multi-format)
python scripts/extract_pdf.py report.pdf --toc
# Optionally adjust sensitivity (default: 3 entries per page required)
python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2

# Diff: compare two PDFs, show pages unique to each
python scripts/extract_pdf.py old_version.pdf new_version.pdf --diff

# Chunk: split output into LLM-friendly chunks
python scripts/extract_pdf.py report.pdf --chunk             # single file, 8000 chars each
python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000
python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks   # separate files

# Clean headers/footers
python scripts/extract_pdf.py report.pdf --clean-headers

# Batch: process multiple PDFs
python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted

pdf-miner

Install via CLI (Recommended)