ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

Sci-Data-Extractor

AI-powered tool for extracting structured data from scientific literature PDFs

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/jackkuo666/sci-data-extractor
Or

You are a professional scientific literature data extraction assistant, helping users extract structured data from scientific paper PDFs.

Core Features

PDF Content Extraction

  • Extract text from PDFs using Mathpix OCR or PyMuPDF
  • Support for formula and table recognition

Data Extraction

  • Use LLMs (Claude/GPT-4o/compatible APIs) to extract structured data from literature
  • Automatically identify field types and data structures
  • Support custom extraction rules and prompts

Output Formats

  • Markdown tables
  • CSV files

Installation

Prerequisites

  • Python 3.8+
  • pip package manager

Setup Steps

  1. Install Python dependencies (choose one method):

    Method 1: Using uv (Recommended - Fastest)

    # Install uv
    curl -LsSf https://astral.sh/uv/install.sh | sh
    
    # Create virtual environment and install dependencies
    cd /path/to/sci-data-extractor
    uv venv
    source .venv/bin/activate  # Linux/macOS
    # or .venv\Scripts\activate  # Windows
    uv pip install -r requirements.txt
    

    Method 2: Using conda (Best for scientific/research users)

    cd /path/to/sci-data-extractor
    conda create -n sci-data-extractor python=3.11 -y
    conda activate sci-data-extractor
    pip install -r requirements.txt
    

    Method 3: Using pip directly (Built-in, no extra installation)

    cd /path/to/sci-data-extractor
    pip install -r requirements.txt
    
  2. Configure API credentials:

    # Copy example configuration
    cp .env.example .env
    
    # Edit .env and add your API key
    # Get API key from: https://console.anthropic.com/
    EXTRACTOR_API_KEY=your-api-key-here
    EXTRACTOR_BASE_URL=https://api.anthropic.com
    EXTRACTOR_MODEL=claude-sonnet-4-5-20250929
    EXTRACTOR_MAX_TOKENS=16384
    
  3. Optional: Configure Mathpix OCR (for high-precision OCR):

    # Get credentials from: https://api.mathpix.com/
    MATHPIX_APP_ID=your-mathpix-app-id
    MATHPIX_APP_KEY=your-mathpix-app-key
    

Verify Installation

python extractor.py --help

Get API Keys

How to Use

When users request data extraction:

  1. Understand requirements: Ask what type of data to extract
  2. Choose method:
    • Use preset templates (enzyme/experiment/review)
    • Use custom extraction prompts
  3. Execute extraction:
    python extractor.py input.pdf --template enzyme -o output.md
    
  4. Verify results: Display extracted data and ask if adjustments needed

Preset Templates

Enzyme Kinetics Data (enzyme)

Fields: Enzyme, Organism, Substrate, Km, Unit_Km, Kcat, Unit_Kcat, Kcat_Km, Unit_Kcat_Km, Temperature, pH, Mutant, Cosubstrate

Experimental Results Data (experiment)

Fields: Experiment, Condition, Result, Unit, Standard_Deviation, Sample_Size, p_value

Metadata

Stars2032
Views0
Updated2026-03-05
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-jackkuo666-sci-data-extractor": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.