pdf-product-catalog
从 PDF 产品目录(模具图纸)中自动提取产品信息,生成结构化知识库和 Excel 填充数据。
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/cjboy007/ssa-pdf-catalogPDF 产品目录提取 Skill
技能名称: pdf-product-catalog
版本: 1.0
日期: 2026-03-23
作者: IRON 💪
📋 技能描述
从 PDF 产品目录(模具图纸)中自动提取产品信息,生成结构化知识库和 Excel 填充数据。
适用场景:
- 产品目录 PDF 批量处理
- 模具图纸信息提取
- 产品知识库建立
- Excel 产品数据填充
🎯 核心功能
- PDF 文本提取 — pdftotext(矢量图)+ Docling OCR(图片格式 fallback)
- 关键信息识别 — 模具号、包装规范、客户品名、长度
- 错误排除 — 识别并排除包装规范(BJ0599-XXXX)误认为模具号
- 知识库生成 — Markdown 词条 + JSON 结构化数据
- Excel 填充 — 自动填充 SKU→模具号映射
📁 文件结构
pdf-product-catalog/
├── SKILL.md # 技能说明(本文件)
├── scripts/
│ └── extract.py # 主提取脚本
├── examples/
│ ├── sample_input.json # 输入示例
│ └── sample_output.md # 输出示例
└── output/ # 输出目录(运行时生成)
🔧 使用方法
1️⃣ 基础用法
python3 skills/pdf-product-catalog/scripts/extract.py \
--pdf-dir "/path/to/pdf/files" \
--output-dir "/path/to/output"
2️⃣ 完整参数
python3 scripts/extract.py \
--pdf-dir "/path/to/pdfs" \
--output-dir "/path/to/output" \
--excel-path "/path/to/excel.xlsx" \
--ocr-threshold 300 \
--verbose
3️⃣ 参数说明
| 参数 | 必需 | 说明 | 默认值 |
|---|---|---|---|
--pdf-dir | ✅ | PDF 文件目录 | - |
--output-dir | ✅ | 输出目录 | - |
--excel-path | ❌ | Excel 文件路径(用于填充) | None |
--ocr-threshold | ❌ | 文本少于多少字符时启用 OCR | 300 |
--verbose | ❌ | 详细输出模式 | False |
📊 提取流程
Step 1: PDF 文本提取
# 优先使用 pdftotext(矢量图 PDF 准确快速)
result = subprocess.run(['pdftotext', pdf_path, '-'], capture_output=True, text=True)
# 如果文本太短(<300 字符),启用 OCR fallback
if len(text) < ocr_threshold:
# 转图片 + Docling OCR
subprocess.run(['pdftoppm', '-png', '-r', '300', pdf_path, img_path])
ocr_result = converter.convert(img_path)
text = ocr_result.document.export_to_markdown()
Step 2: 关键信息提取
# 1. 模具号 (MODEL NO.) - 优先级最高
model_match = re.search(r'MODEL\s+NO\.?\s*[:\|\s\n]*([A-Z]{2,}-\d+[A-Z]?)', text, re.IGNORECASE)
# 2. 包装规范 (Package No.) - BJ0599-XXXX 格式
pkg_matches = re.findall(r'(BJ0599-\d{4})', text)
# 3. 客户品名 (CUSTOMER ITEM)
ci_match = re.search(r'CUSTOMER ITEM\s*\n([A-Za-z0-9\-]+)', text, re.IGNORECASE)
# 4. 长度 (LENGTH)
length_matches = re.findall(r'(\d{2,4})\s*\+\d*\s*-\d*\s*(mm)?', text)
Step 3: 错误排除规则
# ❌ 排除规则 1: BJ0599-XXXX 是包装规范,不是模具号
if model_no.startswith('BJ0599-'):
model_no = None # 重新提取
# ❌ 排除规则 2: 客户品名不是模具号
if customer_item == model_no:
# 可能模具号在图片中,需要 OCR 重新提取
model_no = extract_with_ocr()
# ❌ 排除规则 3: 太短的字符串不是模具号
if len(model_no) < 5:
model_no = None # 如 "TP" 需要人工确认
Step 4: 知识库生成
# Markdown 词条
product_md = f"""# {pdf_file} 产品类目词条
## 基础信息
- **模具号 (MODEL NO.):** {model_no}
- **包装规范:** {', '.join(package_specs)}
- **客户品名:** {', '.join(customer_items)}
- **长度:** {', '.join(lengths)}
## 产品类目
(详细产品信息...)
"""
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-cjboy007-ssa-pdf-catalog": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
logistics
物流管理技能,提供提单生成、报关单据生成、物流跟踪等功能。支持 OKKI 客户数据同步和自动化文档处理。
okki-email-sync
Synchronize email activities and quotation events with OKKI CRM as follow-up trail records. Automatically matches emails to CRM customers via domain lookup and vector search, creates trail records (email type=102, quotation type=101), and deduplicates entries. Requires OKKI CRM API access and optional vector search setup. Use when you need to automatically log email communications and quotation events in your CRM.
follow-up-engine
Automated customer follow-up scheduling and execution engine for B2B sales. Generates personalized follow-up email drafts based on customer stage, last contact date, and follow-up strategy. Integrates with CRM systems (configurable) to sync follow-up records. Use when you need to automate outbound sales follow-ups, schedule reminders, or generate follow-up email content for dormant leads.
报价单工作流
自动化生成报价单(Excel/Word/HTML/PDF),集成数据验证防止示例数据,支持 OKKI CRM
auto-evolution
Multi-agent auto-evolution system — orchestrate review-execute-audit loops with 4 roles (Coordinator, Reviewer, Executor, Auditor). A single coordinator agent drives the loop by spawning sub-agents for review, execution, and audit. Break goals into subtasks, auto-iterate with dual quality gates, and auto-package results. Use when: user wants autonomous task execution with built-in quality assurance.