Official Verified data analysis Safety 4/5

data-extractor

Extract structured data from any document format using unstructured - unified document processing

Why use this skill?

Use the data-extractor skill to parse PDFs, Word, HTML, and more into structured data. Easily automate document processing workflows in OpenClaw.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/lijie420461340/data-extractor

Download Source Code (.zip)

What This Skill Does

The data-extractor skill for OpenClaw is a powerful utility designed to bridge the gap between unstructured file formats and actionable data. Utilizing the robust 'unstructured' library, this skill automatically detects file types and converts them into a standardized, machine-readable format. Whether you are dealing with complex multi-page PDFs, Excel spreadsheets containing tables, or standard HTML web pages, this agent interprets the underlying structure to isolate text, metadata, tables, and images. It effectively normalizes the chaotic nature of diverse documents, providing a consistent API for downstream analysis.

Installation

You can integrate this skill into your local OpenClaw environment by running the following command in your terminal:

clawhub install openclaw/skills/skills/lijie420461340/data-extractor

Ensure that you have the necessary dependencies configured to allow the underlying unstructured library to perform high-resolution OCR tasks if you intend to process scanned images or image-heavy PDFs.

Use Cases

This skill is indispensable for data-heavy workflows. Use it to:

Automate invoice data extraction: Extract line items from vendor PDFs directly into accounting software.
Academic research: Parse large volumes of research papers to pull specific tables or citations.
Content migration: Scrape legacy HTML content or Word documents into a structured database format.
Email workflow automation: Automatically extract body text and attachment metadata from support or sales inquiries.

Example Prompts

"Please scan this invoice PDF and extract the total amount, date, and vendor name into a JSON format."
"Read the attached annual report document and summarize all tables into a CSV file for my research."
"Process this batch of email exports and tell me who the top three senders are based on the metadata."

Tips & Limitations

To get the best results, specify your processing strategy (e.g., 'hi_res' for complex documents or 'fast' for simple text). Note that memory consumption can be high when processing massive PDF documents with many images. Always ensure that the files you provide are accessible by the agent. For highly distorted images, consider pre-processing or ensuring high-quality scans to improve OCR accuracy.

Read Full Documentation on GitHub

Metadata

Author@lijie420461340

Stars1656

Updated2026-02-28

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-lijie420461340-data-extractor": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Related Skills

scrapebadger

Web scraping platform — Twitter/X data, Vinted marketplace, and general web scraping API

0xghostcasper 4473

Spreadsheet & Data Wrangling Master

Complete spreadsheet methodology — data cleanup, transformation, analysis, dashboards, automation, and reporting. Works with CSV, Excel, Google Sheets, or any tabular data. Use when the user needs to clean messy data, build reports, create dashboards, automate recurring spreadsheet tasks, or transform data between formats.

1kalin 4473

dataset-intake-auditor

在新数据集接入前检查字段、单位、缺失率、异常值与可用性。；use for data, dataset, audit workflows；do not use for 伪造统计结果, 替代正式数据治理平台.

52yuanchangxing 4473

DocPilot

智能文档处理专家，支持文档解析、信息抽取、文档分类

ankylala 4473

olo-deal-memo

Investment memorandum generation for M&A — structured deal write-ups from the acquirer's perspective with data-backed analysis

aniebyl 4473