Official Verified productivity Safety 3/5

screen-vision

macOS Local OCR & Automation Tool using Vision Framework. Zero token cost for screen understanding.

Why use this skill?

Enhance OpenClaw with Screen-Vision, a local macOS OCR tool for screen reading and UI automation. Get zero-cost, high-speed visual insights.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/ls18166407597-design/screen-vision

Download Source Code (.zip)

What This Skill Does

The screen-vision skill acts as the visual cortex for your OpenClaw agent, utilizing the native macOS Vision framework to perform high-speed Optical Character Recognition (OCR) directly on your local machine. Unlike cloud-based vision models that require uploading screenshots to remote servers, this tool processes image data locally, ensuring zero token costs and complete data privacy. It doesn't just read text; it maps the precise [X, Y] coordinates of every element on your screen, enabling the AI to identify button labels, status indicators, and input fields. By bridging the gap between raw pixel data and actionable metadata, it allows your agent to "see" and interact with any application, regardless of whether it provides an API or accessibility tree.

Installation

To integrate this skill, run the following command in your OpenClaw terminal: clawhub install openclaw/skills/skills/ls18166407597-design/screen-vision. After installation, you must grant the necessary permissions for the tool to function. Navigate to System Settings -> Privacy & Security -> Screen Recording and ensure your terminal/OpenClaw app is enabled. Repeat this process under the Accessibility section to allow the agent to execute mouse clicks and keyboard inputs based on the coordinates it retrieves.

Use Cases

This skill is perfect for automating legacy software, non-standard UI interfaces like Electron-based desktop apps, or professional creative tools that lack automation interfaces. Use it to monitor real-time data changes in dashboards, such as tracking stock price movements, server health alerts, or progress bars in long-running renders. It is also highly effective for workflow orchestration, where the agent needs to move data from a non-copyable UI element into a spreadsheet or messaging app.

Example Prompts

"Scan the current window for the 'Export' button and click it once you find it."
"Monitor the progress bar in my rendering software and notify me in Slack when it reaches 100%."
"Read the balance displayed in my crypto desktop wallet and save it to a local text file."

Tips & Limitations

For best results, ensure your screen scaling is set to standard. Because this tool relies on pixel recognition, dynamic UI changes or rapid animations may occasionally cause detection latency. Always ensure your desktop resolution remains consistent while the agent is running automated tasks to prevent coordinate offsets. While highly efficient, it cannot "see" through overlapping windows, so it is recommended to keep the target application in the foreground during execution.

Read Full Documentation on GitHub

Metadata

Author@ls18166407597-design

Stars1601

Updated2026-02-27

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-ls18166407597-design-screen-vision": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Tags(AI)

#ocr#automation#macos#vision#screen-reading

Safety Score: 3/5

Flags: file-read, file-write, code-execution

Related Skills

tg-checkin

Generic Telegram Web automation for group check-ins. Supports multiple groups via configuration.

ls18166407597-design 1601

tg-history

Efficiently extract and read Telegram group chat history as text, bypassing screenshots/OCR for zero-token-waste.

ls18166407597-design 1601

model-usage

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

ls18166407597-design 1601