data-extractor
Extract structured data from any document format using unstructured - unified document processing
Why use this skill?
Use the data-extractor skill to parse PDFs, Word, HTML, and more into structured data. Easily automate document processing workflows in OpenClaw.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/lijie420461340/data-extractorWhat This Skill Does
The data-extractor skill for OpenClaw is a powerful utility designed to bridge the gap between unstructured file formats and actionable data. Utilizing the robust 'unstructured' library, this skill automatically detects file types and converts them into a standardized, machine-readable format. Whether you are dealing with complex multi-page PDFs, Excel spreadsheets containing tables, or standard HTML web pages, this agent interprets the underlying structure to isolate text, metadata, tables, and images. It effectively normalizes the chaotic nature of diverse documents, providing a consistent API for downstream analysis.
Installation
You can integrate this skill into your local OpenClaw environment by running the following command in your terminal:
clawhub install openclaw/skills/skills/lijie420461340/data-extractor
Ensure that you have the necessary dependencies configured to allow the underlying unstructured library to perform high-resolution OCR tasks if you intend to process scanned images or image-heavy PDFs.
Use Cases
This skill is indispensable for data-heavy workflows. Use it to:
- Automate invoice data extraction: Extract line items from vendor PDFs directly into accounting software.
- Academic research: Parse large volumes of research papers to pull specific tables or citations.
- Content migration: Scrape legacy HTML content or Word documents into a structured database format.
- Email workflow automation: Automatically extract body text and attachment metadata from support or sales inquiries.
Example Prompts
- "Please scan this invoice PDF and extract the total amount, date, and vendor name into a JSON format."
- "Read the attached annual report document and summarize all tables into a CSV file for my research."
- "Process this batch of email exports and tell me who the top three senders are based on the metadata."
Tips & Limitations
To get the best results, specify your processing strategy (e.g., 'hi_res' for complex documents or 'fast' for simple text). Note that memory consumption can be high when processing massive PDF documents with many images. Always ensure that the files you provide are accessible by the agent. For highly distorted images, consider pre-processing or ensuring high-quality scans to improve OCR accuracy.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-lijie420461340-data-extractor": {
"enabled": true,
"auto_update": true
}
}
}Tags
Flags: file-read
Related Skills
scrapebadger
Web scraping platform — Twitter/X data, Vinted marketplace, and general web scraping API
Spreadsheet & Data Wrangling Master
Complete spreadsheet methodology — data cleanup, transformation, analysis, dashboards, automation, and reporting. Works with CSV, Excel, Google Sheets, or any tabular data. Use when the user needs to clean messy data, build reports, create dashboards, automate recurring spreadsheet tasks, or transform data between formats.
dataset-intake-auditor
在新数据集接入前检查字段、单位、缺失率、异常值与可用性。;use for data, dataset, audit workflows;do not use for 伪造统计结果, 替代正式数据治理平台.
DocPilot
智能文档处理专家,支持文档解析、信息抽取、文档分类
olo-deal-memo
Investment memorandum generation for M&A — structured deal write-ups from the acquirer's perspective with data-backed analysis