Pdf Ocr Layout
Skill by baokui
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/baokui/pdf-ocr-layoutWhat This Skill Does
The Pdf Ocr Layout skill is a high-precision, multimodal document parsing engine designed to transform static documents into structured, machine-readable data. By leveraging a multi-stage architecture—integrating GLM-OCR for structural layout extraction, GLM-4.7 for logical textual reasoning, and GLM-4.6V for advanced visual analysis—the tool provides a deep semantic understanding of complex documents. It excels at extracting data tables into clean Markdown, isolating charts and illustrations into separate image files, and interpreting the underlying meaning of those visual and tabular elements within their original page context.
Installation
To install this skill, use the OpenClaw CLI tool from your terminal. Ensure you have the necessary environment permissions to download packages and access the source repository:
clawhub install openclaw/skills/skills/baokui/pdf-ocr-layout
Use Cases
- Financial Reporting: Automatically extract and analyze tabular financial data from quarterly PDF reports, transforming raw digits into clean Markdown for spreadsheet import.
- Technical Documentation: Convert dense engineering manuals into structured knowledge bases by separating diagrams and flowcharts from descriptive text.
- Academic Research: Parse research papers to extract experimental charts, using multimodal analysis to summarize visual findings in natural language.
- Compliance Auditing: Efficiently scan large batches of documents to locate and interpret specific table data or imagery required for regulatory compliance.
Example Prompts
- "Open the document at /data/financials.pdf, extract all the quarterly growth tables into Markdown, and analyze the trends shown in the charts on page 4."
- "Look at the technical report /data/specs.png, crop all the circuit diagram images, and provide a textual explanation of each diagram's function."
- "Please parse the document in /data/report.pdf and perform a logical analysis on the main data table, focusing specifically on the year-over-year cost variations."
Tips & Limitations
- Pre-Processing: For best results with scanned physical documents, ensure image resolution is at least 300 DPI to allow GLM-OCR to identify elements accurately.
- Large Files: For multi-hundred-page PDFs, consider splitting the file into smaller chunks, as processing every page may exceed temporary memory buffers.
- Dependencies: This skill relies on external GLM models; ensure your API keys or cloud environment configurations are correctly set up to communicate with the Zhipu AI inference services.
- Output Management: The
output_dirparameter is mandatory. Ensure your environment has write access to the target directory to save the cropped assets.
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-baokui-pdf-ocr-layout": {
"enabled": true,
"auto_update": true
}
}
}Tags(AI)
Flags: file-write, file-read, external-api
Related Skills
pdf-process-mineru
PDF document parsing tool based on local MinerU, supports converting PDF to Markdown, JSON, and other machine-readable formats.
llm-video-generator
Generate videos from text descriptions using ZhipuAI CogVideoX-3 model. Supports text-to-video, image-to-video, and first/last frame-to-video generation. Automatically handles long videos (over 5s) by chaining multiple generation calls with last-frame continuation. Use when the user asks to create/generate a video from text, make a video, text-to-video, 文生视频, 生成视频, 做个视频, or any request involving converting text/images into a video. Supports configuring video content, style, resolution (up to 4K), frame rate (30/60fps), audio, and duration.
wan-t2i
阿里云DashScope Wan2.6文生图工具。使用阿里云百炼平台的Wan2.6-t2i模型生成图片。 当用户需要:AI生成图片、文生图、从文字生成图像时触发。 需要DASHSCOPE_API_KEY环境变量(已在系统中配置)。
glm-v-model
智谱 GLM-4V/4.6V 视觉模型调用技能。用于图像/视频理解、多模态对话、图表分析等任务。 当用户提到:图片理解、图像识别、视觉模型、GLM-4V、GLM-4.6V、多模态分析、看图说话、图表分析、视频理解时使用此技能。