ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

pdf-product-catalog

从 PDF 产品目录(模具图纸)中自动提取产品信息,生成结构化知识库和 Excel 填充数据。

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/cjboy007/ssa-pdf-catalog
Or

PDF 产品目录提取 Skill

技能名称: pdf-product-catalog
版本: 1.0
日期: 2026-03-23
作者: IRON 💪


📋 技能描述

从 PDF 产品目录(模具图纸)中自动提取产品信息,生成结构化知识库和 Excel 填充数据。

适用场景:

  • 产品目录 PDF 批量处理
  • 模具图纸信息提取
  • 产品知识库建立
  • Excel 产品数据填充

🎯 核心功能

  1. PDF 文本提取 — pdftotext(矢量图)+ Docling OCR(图片格式 fallback)
  2. 关键信息识别 — 模具号、包装规范、客户品名、长度
  3. 错误排除 — 识别并排除包装规范(BJ0599-XXXX)误认为模具号
  4. 知识库生成 — Markdown 词条 + JSON 结构化数据
  5. Excel 填充 — 自动填充 SKU→模具号映射

📁 文件结构

pdf-product-catalog/
├── SKILL.md              # 技能说明(本文件)
├── scripts/
│   └── extract.py        # 主提取脚本
├── examples/
│   ├── sample_input.json # 输入示例
│   └── sample_output.md  # 输出示例
└── output/               # 输出目录(运行时生成)

🔧 使用方法

1️⃣ 基础用法

python3 skills/pdf-product-catalog/scripts/extract.py \
  --pdf-dir "/path/to/pdf/files" \
  --output-dir "/path/to/output"

2️⃣ 完整参数

python3 scripts/extract.py \
  --pdf-dir "/path/to/pdfs" \
  --output-dir "/path/to/output" \
  --excel-path "/path/to/excel.xlsx" \
  --ocr-threshold 300 \
  --verbose

3️⃣ 参数说明

参数必需说明默认值
--pdf-dirPDF 文件目录-
--output-dir输出目录-
--excel-pathExcel 文件路径(用于填充)None
--ocr-threshold文本少于多少字符时启用 OCR300
--verbose详细输出模式False

📊 提取流程

Step 1: PDF 文本提取

# 优先使用 pdftotext(矢量图 PDF 准确快速)
result = subprocess.run(['pdftotext', pdf_path, '-'], capture_output=True, text=True)

# 如果文本太短(<300 字符),启用 OCR fallback
if len(text) < ocr_threshold:
    # 转图片 + Docling OCR
    subprocess.run(['pdftoppm', '-png', '-r', '300', pdf_path, img_path])
    ocr_result = converter.convert(img_path)
    text = ocr_result.document.export_to_markdown()

Step 2: 关键信息提取

# 1. 模具号 (MODEL NO.) - 优先级最高
model_match = re.search(r'MODEL\s+NO\.?\s*[:\|\s\n]*([A-Z]{2,}-\d+[A-Z]?)', text, re.IGNORECASE)

# 2. 包装规范 (Package No.) - BJ0599-XXXX 格式
pkg_matches = re.findall(r'(BJ0599-\d{4})', text)

# 3. 客户品名 (CUSTOMER ITEM)
ci_match = re.search(r'CUSTOMER ITEM\s*\n([A-Za-z0-9\-]+)', text, re.IGNORECASE)

# 4. 长度 (LENGTH)
length_matches = re.findall(r'(\d{2,4})\s*\+\d*\s*-\d*\s*(mm)?', text)

Step 3: 错误排除规则

# ❌ 排除规则 1: BJ0599-XXXX 是包装规范,不是模具号
if model_no.startswith('BJ0599-'):
    model_no = None  # 重新提取

# ❌ 排除规则 2: 客户品名不是模具号
if customer_item == model_no:
    # 可能模具号在图片中,需要 OCR 重新提取
    model_no = extract_with_ocr()

# ❌ 排除规则 3: 太短的字符串不是模具号
if len(model_no) < 5:
    model_no = None  # 如 "TP" 需要人工确认

Step 4: 知识库生成

# Markdown 词条
product_md = f"""# {pdf_file} 产品类目词条

## 基础信息
- **模具号 (MODEL NO.):** {model_no}
- **包装规范:** {', '.join(package_specs)}
- **客户品名:** {', '.join(customer_items)}
- **长度:** {', '.join(lengths)}

## 产品类目
(详细产品信息...)
"""

Metadata

Author@cjboy007
Stars3562
Views0
Updated2026-03-29
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-cjboy007-ssa-pdf-catalog": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

logistics

物流管理技能,提供提单生成、报关单据生成、物流跟踪等功能。支持 OKKI 客户数据同步和自动化文档处理。

cjboy007 3562

okki-email-sync

Synchronize email activities and quotation events with OKKI CRM as follow-up trail records. Automatically matches emails to CRM customers via domain lookup and vector search, creates trail records (email type=102, quotation type=101), and deduplicates entries. Requires OKKI CRM API access and optional vector search setup. Use when you need to automatically log email communications and quotation events in your CRM.

cjboy007 3562

follow-up-engine

Automated customer follow-up scheduling and execution engine for B2B sales. Generates personalized follow-up email drafts based on customer stage, last contact date, and follow-up strategy. Integrates with CRM systems (configurable) to sync follow-up records. Use when you need to automate outbound sales follow-ups, schedule reminders, or generate follow-up email content for dormant leads.

cjboy007 3562

报价单工作流

自动化生成报价单(Excel/Word/HTML/PDF),集成数据验证防止示例数据,支持 OKKI CRM

cjboy007 3562

auto-evolution

Multi-agent auto-evolution system — orchestrate review-execute-audit loops with 4 roles (Coordinator, Reviewer, Executor, Auditor). A single coordinator agent drives the loop by spawning sub-agents for review, execution, and audit. Break goals into subtasks, auto-iterate with dual quality gates, and auto-package results. Use when: user wants autonomous task execution with built-in quality assurance.

cjboy007 3562