ClawKit Logo
ClawKitReliability Toolkit
Back to Registry
Official Verified

visual-rpa

Visual RPA desktop automation skill. Use when user asks to operate desktop apps, click icons, open applications, type text in input fields, click buttons, scroll pages, send messages via WeChat or other apps. Uses screen capture and Qwen vision model for pure visual positioning without DOM or accessibility APIs.

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/neilhexiaoning-alt/visual-rpa-skill
Or

Visual RPA Desktop Automation

Auto-execute all steps without waiting for user confirmation between steps.

Desktop automation via screen capture + Qwen vision model (Qwen-VL). No DOM or accessibility API needed.

How it works

  1. Capture screen -> thumbnail rough positioning
  2. Full-resolution crop -> precise coordinate refinement
  3. Execute mouse/keyboard action -> screenshot verification
  4. Compound instructions automatically decomposed into atomic steps

Usage

Use exec tool to run commands. Script path: $env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py

Requires DASHSCOPE_API_KEY environment variable to be set.

Single task

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click to open WeChat"

Compound task (auto-decomposed)

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "open WeChat, open File Transfer chat, type hello in input box, click send"

Multi-step task (manually specified)

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click Chrome browser" "type baidu.com in address bar and press enter" "type weather in search box" "click search button"

Skip verification (faster)

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --no-verify --task "click to open Calculator"

Parameters

ParameterDescription
--mode taskBatch task mode (required)
--mode interactiveInteractive mode (default)
--task "step1" "step2"Task instructions, supports multiple
--no-verifySkip post-action verification
--model MODELVision model name (default: qwen-vl-max-latest)
--api-key KEYAPI Key (defaults to DASHSCOPE_API_KEY env var)

Supported actions

ActionExample instructions
Click"click start menu", "click Chrome icon"
Double click"double click Recycle Bin on desktop"
Right click"right click on desktop blank area"
Type text"type weather in search box", "type hello in input box"
Hotkey"press Ctrl+C"
Scroll"scroll down the page"
Wait"wait for page to load"

Instruction tips

  • Be specific: "click WeChat icon on taskbar" is better than "open WeChat"
  • Instructions can be in Chinese or English, the model understands both
  • Complex operations can be written as compound instructions, system auto-decomposes
  • For text input: say "type XXX in YYY", system auto-detects as input action

Output format

  [OK] Step 0: click to open WeChat
       click @ (375,1591)
  [OK] Step 1: click File Transfer Assistant in WeChat
       click @ (154,97)
  [FAIL] Step 2: type hello in input box
       type @ (300,1364)
  2/3 succeeded
  • OK = action succeeded and verified
  • FAIL = action failed or verification failed, auto-retries up to 3 times

Common scenarios

Metadata

Stars1335
Views1
Updated2026-02-23
View Author Profile
AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill
Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-neilhexiaoning-alt-visual-rpa-skill": {
      "enabled": true,
      "auto_update": true
    }
  }
}
Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.