Official Verified data analysis Safety 4/5

Pdf Ocr Layout

Skill by baokui

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/baokui/pdf-ocr-layout

Download Source Code (.zip)

What This Skill Does

The Pdf Ocr Layout skill is a high-precision, multimodal document parsing engine designed to transform static documents into structured, machine-readable data. By leveraging a multi-stage architecture—integrating GLM-OCR for structural layout extraction, GLM-4.7 for logical textual reasoning, and GLM-4.6V for advanced visual analysis—the tool provides a deep semantic understanding of complex documents. It excels at extracting data tables into clean Markdown, isolating charts and illustrations into separate image files, and interpreting the underlying meaning of those visual and tabular elements within their original page context.

Installation

To install this skill, use the OpenClaw CLI tool from your terminal. Ensure you have the necessary environment permissions to download packages and access the source repository:

clawhub install openclaw/skills/skills/baokui/pdf-ocr-layout

Use Cases

Financial Reporting: Automatically extract and analyze tabular financial data from quarterly PDF reports, transforming raw digits into clean Markdown for spreadsheet import.
Technical Documentation: Convert dense engineering manuals into structured knowledge bases by separating diagrams and flowcharts from descriptive text.
Academic Research: Parse research papers to extract experimental charts, using multimodal analysis to summarize visual findings in natural language.
Compliance Auditing: Efficiently scan large batches of documents to locate and interpret specific table data or imagery required for regulatory compliance.

Example Prompts

"Open the document at /data/financials.pdf, extract all the quarterly growth tables into Markdown, and analyze the trends shown in the charts on page 4."
"Look at the technical report /data/specs.png, crop all the circuit diagram images, and provide a textual explanation of each diagram's function."
"Please parse the document in /data/report.pdf and perform a logical analysis on the main data table, focusing specifically on the year-over-year cost variations."

Tips & Limitations

Pre-Processing: For best results with scanned physical documents, ensure image resolution is at least 300 DPI to allow GLM-OCR to identify elements accurately.
Large Files: For multi-hundred-page PDFs, consider splitting the file into smaller chunks, as processing every page may exceed temporary memory buffers.
Dependencies: This skill relies on external GLM models; ensure your API keys or cloud environment configurations are correctly set up to communicate with the Zhipu AI inference services.
Output Management: The output_dir parameter is mandatory. Ensure your environment has write access to the target directory to save the cropped assets.

Read Full Documentation on GitHub

Metadata

Author@baokui

Stars4473

Updated2026-05-01

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-baokui-pdf-ocr-layout": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#ocr#pdf-parsing#data-extraction#multimodal#document-analysis

Safety Score: 4/5

Flags: file-write, file-read, external-api

Related Skills

pdf-process-mineru

PDF document parsing tool based on local MinerU, supports converting PDF to Markdown, JSON, and other machine-readable formats.

baokui 4473

llm-video-generator

Generate videos from text descriptions using ZhipuAI CogVideoX-3 model. Supports text-to-video, image-to-video, and first/last frame-to-video generation. Automatically handles long videos (over 5s) by chaining multiple generation calls with last-frame continuation. Use when the user asks to create/generate a video from text, make a video, text-to-video, 文生视频, 生成视频, 做个视频, or any request involving converting text/images into a video. Supports configuring video content, style, resolution (up to 4K), frame rate (30/60fps), audio, and duration.

baokui 4473

wan-t2i

阿里云DashScope Wan2.6文生图工具。使用阿里云百炼平台的Wan2.6-t2i模型生成图片。当用户需要：AI生成图片、文生图、从文字生成图像时触发。需要DASHSCOPE_API_KEY环境变量（已在系统中配置）。

baokui 4473

glm-v-model

智谱 GLM-4V/4.6V 视觉模型调用技能。用于图像/视频理解、多模态对话、图表分析等任务。当用户提到：图片理解、图像识别、视觉模型、GLM-4V、GLM-4.6V、多模态分析、看图说话、图表分析、视频理解时使用此技能。

baokui 4473