What This Skill Does

The doc-extract-filter skill is a sophisticated file processing utility designed for OpenClaw agents to handle unstructured text data at scale. It provides two primary modes of operation: 'extract' and 'filter'. In 'extract' mode, the tool parses various file formats (including PDF, Docx, and TXT) to retrieve raw textual content. It includes an optional OCR (Optical Character Recognition) feature, allowing agents to process scanned PDFs that contain non-selectable text. In 'filter' mode, the skill performs granular data extraction based on user-defined criteria. Users can apply inclusion and exclusion rules using both exact keyword matching and complex regular expressions. The skill further supports a 'batch' processing mode, allowing agents to manage multiple files or entire directories simultaneously, with options to merge outputs into a single JSON file for streamlined downstream processing.

Installation

You can integrate this skill into your OpenClaw environment by running the following command in your terminal:

clawhub install openclaw/skills/skills/bigclawd/doc-extract-filter

Ensure that your environment has the necessary dependencies for PDF parsing and OCR libraries if you intend to use the enable_ocr flag.

Use Cases

This skill is highly effective for document-intensive workflows. Common use cases include: 1) Automated document auditing, where an agent extracts specific clause text from a directory of contracts. 2) Log file analysis, where regex filters isolate error messages from large technical logs. 3) Data migration prep, where unstructured reports are converted into structured, filtered JSON datasets. 4) Research automation, where an agent processes hundreds of PDFs to find mentions of specific entities while ignoring irrelevant boilerplate text.

Example Prompts

"Extract all text from the PDF at /documents/annual_report.pdf and convert it into a clean format, ensuring you enable OCR if the file is scanned."
"Look through the /data/invoices directory. Filter for any lines containing 'Tax' or 'VAT' while excluding any rows that contain 'Refund', and save the combined results to /data/output/summary.json."
"Process the file 'project_notes.txt'. Use a regex pattern to pull out all email addresses found within the document and present them in a list."

Tips & Limitations

For optimal performance, always specify the 'filter_level' parameter; setting it to 'paragraph' is better for qualitative document analysis, while 'line' is superior for log file or list processing. Note that large-scale OCR operations are computationally expensive and may increase processing time significantly. When using regex filters, ensure your patterns are optimized to avoid catastrophic backtracking on very long text files. Additionally, the merge_results flag is best used when your output directory is clean to prevent overwriting existing data. Always verify that your file paths are accessible to the OpenClaw agent instance before initiating a batch job.

Doc Extract Filter

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)