ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

pdf-miner

Extract text and tables from PDF files with robust support for global market data formats (currencies, percentages, units). Use when: (1) User asks to read/extract content from a PDF file, (2) User needs text or tables from industry reports, research papers, or financial documents, (3) web_fetch or scrapling fail on a PDF. Supports: keyword search, metrics extraction, table of contents detection, PDF diff/comparison, LLM chunk splitting, batch processing, header/footer cleaning. NOT for: OCR on scanned image-based PDFs, editing/merging PDFs, or creating new PDFs.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/baichenwzj/pdf-miner
Or

PDF Miner Skill

Extract text and tables from PDF files using pdfplumber (global market formats).

Prerequisites

python -m pip install pdfplumber

For OCR capabilities (scanned/image PDFs), also install:

python -m pip install pymupdf openai

Initial Setup for OCR

Before using --ocr, you must provide a vision API credential. There are three ways:

  1. Environment variables (recommended for temporary use):

    export OCR_API_KEY="your-openrouter-api-key"
    export OCR_MODEL="qwen/qwen3.6-plus:free"   # optional
    export OCR_BASE_URL="https://openrouter.ai/api/v1"   # optional
    
  2. Config file (persistent, skill-specific):
    Create skills/skills/pdf-miner/config.json with:

    {
      "vision_api_key": "your-openrouter-api-key",
      "vision_model": "qwen/qwen3.6-plus:free",
      "vision_base_url": "https://openrouter.ai/api/v1"
    }
    
  3. Command-line arguments (override per invocation):

    python scripts/extract_pdf.py scanned.pdf --ocr --ocr-api-key "sk-..." --ocr-model "stepfun/step-3.5-flash:free"
    

Usage

Run commands from this skill directory.

Basic Extraction

# Full extraction (text + tables)
python scripts/extract_pdf.py input.pdf

# Output to custom path
python scripts/extract_pdf.py input.pdf output.md

# Specific pages
python scripts/extract_pdf.py input.pdf --pages 1-5,10,15-20

# Text or tables only
python scripts/extract_pdf.py input.pdf --text-only
python scripts/extract_pdf.py input.pdf --tables-only
python scripts/extract_pdf.py input.pdf --tables-only --json

Advanced Modes

# Search: find pages containing keywords with context
python scripts/extract_pdf.py report.pdf --search "Vietnam export penetration"

# Metrics: extract lines with keywords + numeric values
python scripts/extract_pdf.py report.pdf --metrics "market size growth export penetration"

# TOC: extract table of contents / chapter structure (robust, multi-format)
python scripts/extract_pdf.py report.pdf --toc
# Optionally adjust sensitivity (default: 3 entries per page required)
python scripts/extract_pdf.py report.pdf --toc --toc-min-entries 2

# Diff: compare two PDFs, show pages unique to each
python scripts/extract_pdf.py old_version.pdf new_version.pdf --diff

# Chunk: split output into LLM-friendly chunks
python scripts/extract_pdf.py report.pdf --chunk             # single file, 8000 chars each
python scripts/extract_pdf.py report.pdf --chunk --max-chars 4000
python scripts/extract_pdf.py report.pdf --chunk --output-dir ./chunks   # separate files

# Clean headers/footers
python scripts/extract_pdf.py report.pdf --clean-headers

# Batch: process multiple PDFs
python scripts/extract_pdf.py file1.pdf file2.pdf file3.pdf --output-dir ./extracted

OCR for Scanned/Image PDFs (Automatic by Default)

Metadata

Stars4473
Views0
Updated2026-05-01
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-baichenwzj-pdf-miner": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.