ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

vision-tagger

Tag and annotate images using Apple Vision framework (macOS only). Detects faces, bodies, hands, text (OCR), barcodes, objects, scene labels, and saliency regions. Use for image analysis, photo tagging, posture monitoring, or any task requiring computer vision on images.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/sagarjhaa/vision-tagger
Or

Vision Tagger

macOS-native image analysis using Apple's Vision framework. All processing is local — no cloud APIs, no API keys needed.

Requirements

  • macOS 12+ (Monterey or later)
  • Xcode Command Line Tools
  • Python 3 with Pillow

Setup (one-time)

# Install Xcode CLI tools if needed
xcode-select --install

# Install Pillow
pip3 install Pillow

# Compile the Swift binary
cd scripts/
swiftc -O -o image_tagger image_tagger.swift

Usage

Analyze image → JSON

./scripts/image_tagger /path/to/photo.jpg

Output includes:

  • faces — bounding boxes, roll/yaw/pitch, landmarks (eyes, nose, mouth)
  • bodies — 18 skeleton joints with confidence scores
  • hands — 21 joints per hand (left/right)
  • text — OCR results with bounding boxes
  • labels — scene classification (desk, outdoor, clothing, etc.)
  • barcodes — QR codes, UPC, etc.
  • saliency — attention and objectness regions

Annotate image with boxes

python3 scripts/annotate_image.py photo.jpg output.jpg

Draws colored boxes:

  • 🟢 Green: faces
  • 🟠 Orange: body skeleton
  • 🟣 Magenta: hands
  • 🔵 Cyan: text regions
  • 🟡 Yellow: rectangles/objects
  • Scene labels at bottom

Python integration

import subprocess, json

def analyze(path):
    r = subprocess.run(['./scripts/image_tagger', path], capture_output=True, text=True)
    return json.loads(r.stdout[r.stdout.find('{'):])

tags = analyze('photo.jpg')
print(tags['labels'])  # [{'label': 'desk', 'confidence': 0.85}, ...]
print(tags['faces'])   # [{'bbox': {...}, 'confidence': 0.99, 'yaw': 5.2}]

Example JSON Output

{
  "dimensions": {"width": 1920, "height": 1080},
  "faces": [{"bbox": {"x": 0.3, "y": 0.4, "width": 0.15, "height": 0.2}, "confidence": 0.99, "roll": -2, "yaw": 5}],
  "bodies": [{"joints": {"head_joint": {"x": 0.5, "y": 0.7, "confidence": 0.9}, "left_shoulder": {...}}, "confidence": 1}],
  "hands": [{"chirality": "left", "joints": {"VNHLKWRI": {"x": 0.4, "y": 0.3, "confidence": 0.85}}}],
  "text": [{"text": "HELLO", "confidence": 0.95, "bbox": {...}}],
  "labels": [{"label": "outdoor", "confidence": 0.88}, {"label": "sky", "confidence": 0.75}],
  "saliency": {"attentionBased": [{"x": 0.2, "y": 0.1, "width": 0.6, "height": 0.8}]}
}

Detection Capabilities

FeatureDetails
FacesBounding box, confidence, roll/yaw/pitch angles, 76-point landmarks
Bodies18 joints: head, neck, shoulders, elbows, wrists, hips, knees, ankles
Hands21 joints per hand, left/right chirality
Text (OCR)Recognized text with confidence and bounding boxes
Labels1000+ scene/object categories (clothing, furniture, outdoor, etc.)
BarcodesQR, UPC, EAN, Code128, PDF417, Aztec, DataMatrix
SaliencyAttention-based and objectness-based regions

Use Cases

Metadata

Author@sagarjhaa
Stars1133
Views0
Updated2026-02-18
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-sagarjhaa-vision-tagger": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.