Official Verified

ocr2markdown

Document OCR and parsing — converts PDF/images to Markdown on remote L4 GPU via Modal. Trigger when user says: OCR, PDF to markdown, parse PDF, extract text from PDF, 文档识别, PDF转Markdown, 扫描件识别. Takes local PDF/image files and returns Markdown with layout, tables, formulas, and OCR preserved.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ardentillumina/ocr2markdown

Download Source Code (.zip)

OCR2Markdown

MinerU document parsing pipeline on Modal L4 GPU — PDF/image → Markdown with layout, tables, formulas, and OCR preserved.

Pipeline code is bundled at ./src/ocr2markdown.py. After npx skills add, runs from any directory.

Workflow

1. Prepare slug and identify files

Slug = task identifier (volume directory name). Use user-provided value, or generate ocr_YYYYMMDD_HHMMSS if none given.

Directory input? Scan for PDF (.pdf), list with index, ask user to confirm selection.

Specific files? Use directly, no listing needed.

2. Upload to volume

Ensure both volumes exist (idempotent):

modal volume create speech2srt-data 2>/dev/null || true
modal volume create speech2srt-models 2>/dev/null || true

Upload each file:

modal volume put speech2srt-data <local_file> <slug>/upload/

Modal put auto-creates remote directories.

3. Run pipeline

modal run ./src/ocr2markdown.py --slug <slug>

Pipeline finds all .pdf files in <slug>/upload/ on the volume and processes them one by one.

Ctrl+C? Stop cleanly, report progress. Re-run with same slug — already-processed PDFs are skipped automatically.

4. Download results

Output directory structure:

<slug>/output/<stem>_ocr/
├── <stem>.md              # Main markdown output
├── images/                # Extracted images referenced by markdown
└── *.pdf, *.json          # Auxiliary outputs (layout, model data)

Download the entire output folder:

modal volume get speech2srt-data <slug>/output/ <local_destination>/

Preserve original directory tree.

5. Report

Done. Parsed N PDF(s) in Xs

Results:
  - <output_dir>/<stem>_ocr/<stem>.md
  - <output_dir>/<stem>_ocr/images/

Setup

Before first run, verify:

Python 3.9+ — python -V
Modal CLI — modal config show:
- token_id null → modal setup to authenticate
- command not found → pip install modal then modal setup
Modal free tier — L4 GPU is $0.80/hr. New accounts get $30 free credits/month (~37 hours of L4). No setup needed beyond authentication.

Performance

PDF Size	Pages	Time/PDF
~40-55 MB	varies	~110-130s each

Pipeline auto-skips PDFs with existing output.

Read Full Documentation on GitHub

Metadata

Author@ardentillumina

Stars4473

Updated2026-05-01

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ardentillumina-ocr2markdown": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

speech-transcribe

3x Faster than Whisper, Speech-to-text transcription with sentence-level timestamps on remote (FREE) L4 GPU. Trigger when user says: transcribe, speech to text, STT, speech recognition, 转录, 语音转文字. Takes local audio/video files and returns .txt (plain text) and .srt (subtitles).

ardentillumina 4473

speech-isolate

Vocal isolation / background music removal on remote (FREE) L4 GPU. Trigger when user says: isolate vocals, remove background music, extract voice, 提取人声, 去除背景音乐, vocal separation. Takes local audio/video files and returns isolated vocals.

ardentillumina 4473

speech-denoise

Speech enhancement / vocal denoising on remote (FREE) L4 GPU. Trigger when user says: denoise, remove noise, clean up audio, 去噪, 降噪, enhance audio. Takes local audio/video files and returns noise-reduced speech audio.

ardentillumina 4473