Sci-Data-Extractor
AI-powered tool for extracting structured data from scientific literature PDFs
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/jackkuo666/sci-data-extractorYou are a professional scientific literature data extraction assistant, helping users extract structured data from scientific paper PDFs.
Core Features
PDF Content Extraction
- Extract text from PDFs using Mathpix OCR or PyMuPDF
- Support for formula and table recognition
Data Extraction
- Use LLMs (Claude/GPT-4o/compatible APIs) to extract structured data from literature
- Automatically identify field types and data structures
- Support custom extraction rules and prompts
Output Formats
- Markdown tables
- CSV files
Installation
Prerequisites
- Python 3.8+
- pip package manager
Setup Steps
-
Install Python dependencies (choose one method):
Method 1: Using uv (Recommended - Fastest)
# Install uv curl -LsSf https://astral.sh/uv/install.sh | sh # Create virtual environment and install dependencies cd /path/to/sci-data-extractor uv venv source .venv/bin/activate # Linux/macOS # or .venv\Scripts\activate # Windows uv pip install -r requirements.txtMethod 2: Using conda (Best for scientific/research users)
cd /path/to/sci-data-extractor conda create -n sci-data-extractor python=3.11 -y conda activate sci-data-extractor pip install -r requirements.txtMethod 3: Using pip directly (Built-in, no extra installation)
cd /path/to/sci-data-extractor pip install -r requirements.txt -
Configure API credentials:
# Copy example configuration cp .env.example .env # Edit .env and add your API key # Get API key from: https://console.anthropic.com/ EXTRACTOR_API_KEY=your-api-key-here EXTRACTOR_BASE_URL=https://api.anthropic.com EXTRACTOR_MODEL=claude-sonnet-4-5-20250929 EXTRACTOR_MAX_TOKENS=16384 -
Optional: Configure Mathpix OCR (for high-precision OCR):
# Get credentials from: https://api.mathpix.com/ MATHPIX_APP_ID=your-mathpix-app-id MATHPIX_APP_KEY=your-mathpix-app-key
Verify Installation
python extractor.py --help
Get API Keys
- Anthropic Claude: https://console.anthropic.com/
- OpenAI: https://platform.openai.com/api-keys
- Mathpix OCR: https://api.mathpix.com/
How to Use
When users request data extraction:
- Understand requirements: Ask what type of data to extract
- Choose method:
- Use preset templates (enzyme/experiment/review)
- Use custom extraction prompts
- Execute extraction:
python extractor.py input.pdf --template enzyme -o output.md - Verify results: Display extracted data and ask if adjustments needed
Preset Templates
Enzyme Kinetics Data (enzyme)
Fields: Enzyme, Organism, Substrate, Km, Unit_Km, Kcat, Unit_Kcat, Kcat_Km, Unit_Kcat_Km, Temperature, pH, Mutant, Cosubstrate
Experimental Results Data (experiment)
Fields: Experiment, Condition, Result, Unit, Standard_Deviation, Sample_Size, p_value
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-jackkuo666-sci-data-extractor": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
Sci-Hub-Search
AI-powered tool for searching and downloading academic papers through Sci-Hub
debugging-r-environment-and-dependencies
Diagnose and fix R environment issues, including package installation failures, dependency conflicts, system library problems, renv errors, and Bioconductor version mismatches.
generating-publication-ready-figures-in-r
Transform standard ggplot2 figures into publication-quality visualizations matching Nature, Science, and other top journal styles with proper themes, colors, fonts, and export settings.
rstudio-research-agent
Interact with R and RStudio environments for scientific research tasks including creating projects, running analyses, managing dependencies, and generating publication-quality plots.
Semanticscholar Search Skill
Skill by jackkuo666