Visual RPA Desktop Automation

Auto-execute all steps without waiting for user confirmation between steps.

Desktop automation via screen capture + Qwen vision model (Qwen-VL). No DOM or accessibility API needed.

How it works

Capture screen -> thumbnail rough positioning
Full-resolution crop -> precise coordinate refinement
Execute mouse/keyboard action -> screenshot verification
Compound instructions automatically decomposed into atomic steps

Usage

Use exec tool to run commands. Script path: $env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py

Requires DASHSCOPE_API_KEY environment variable to be set.

Single task

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click to open WeChat"

Compound task (auto-decomposed)

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "open WeChat, open File Transfer chat, type hello in input box, click send"

Multi-step task (manually specified)

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --task "click Chrome browser" "type baidu.com in address bar and press enter" "type weather in search box" "click search button"

Skip verification (faster)

python "$env:TAXBOT_ROOT/skills/visual-rpa/scripts/visual_rpa.py" --mode task --no-verify --task "click to open Calculator"

Parameters

Parameter	Description
`--mode task`	Batch task mode (required)
`--mode interactive`	Interactive mode (default)
`--task "step1" "step2"`	Task instructions, supports multiple
`--no-verify`	Skip post-action verification
`--model MODEL`	Vision model name (default: qwen-vl-max-latest)
`--api-key KEY`	API Key (defaults to DASHSCOPE_API_KEY env var)

Supported actions

Action	Example instructions
Click	"click start menu", "click Chrome icon"
Double click	"double click Recycle Bin on desktop"
Right click	"right click on desktop blank area"
Type text	"type weather in search box", "type hello in input box"
Hotkey	"press Ctrl+C"
Scroll	"scroll down the page"
Wait	"wait for page to load"

Instruction tips

Be specific: "click WeChat icon on taskbar" is better than "open WeChat"
Instructions can be in Chinese or English, the model understands both
Complex operations can be written as compound instructions, system auto-decomposes
For text input: say "type XXX in YYY", system auto-detects as input action

Output format

  [OK] Step 0: click to open WeChat
       click @ (375,1591)
  [OK] Step 1: click File Transfer Assistant in WeChat
       click @ (154,97)
  [FAIL] Step 2: type hello in input box
       type @ (300,1364)
  2/3 succeeded

OK = action succeeded and verified
FAIL = action failed or verification failed, auto-retries up to 3 times

visual-rpa

Install via CLI (Recommended)