Android Device Automation
Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app launches, screenshots, and more. Trigger keywords: android, phone, mobile app, tap, swipe, install app, open app on phone, android device, mobile automation, adb, launch app, mobile screen Powered by Midscene.js (https://midscenejs.com)
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/quanru/midscene-android-automationAndroid Device Automation
CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:
- Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.
- Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.
- Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex
actcommands may need even longer.
Automate Android devices using npx @midscene/android@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.
Prerequisites
Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):
MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"
Example: Gemini (Gemini-3-Flash)
MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"
Example: Qwen3-VL
MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
MIDSCENE_MODEL_NAME="qwen/qwen3-vl-235b-a22b-instruct"
MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
MIDSCENE_MODEL_FAMILY="qwen3-vl"
Example: Doubao Seed 1.6
MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-1-6-250615"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-vision"
Commonly used models: Doubao Seed 1.6, Qwen3-VL, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.
If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.
Commands
Connect to Device
npx @midscene/android@1 connect
npx @midscene/android@1 connect --deviceId emulator-5554
Take Screenshot
npx @midscene/android@1 take_screenshot
After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.
Perform Action
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-quanru-midscene-android-automation": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
iOS Device Automation
Vision-driven iOS device automation using Midscene CLI. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control iOS devices with natural language commands via WebDriverAgent. Triggers: ios, iphone, ipad, ios app, tap on iphone, swipe, mobile app ios, ios device, ios testing, iphone automation, ipad automation, ios screen, ios navigate Powered by Midscene.js (https://midscenejs.com)
Desktop Computer Automation
Vision-driven desktop automation using Midscene. Control your desktop (macOS, Windows, Linux) with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, minimize window, maximize window Powered by Midscene.js (https://midscenejs.com)
Browser Automation
Vision-driven browser automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Opens a new browser tab for each target URL via Puppeteer (headless Chrome). Use this skill when the user wants to: - Browse, navigate, or open web pages - Scrape, extract, or collect data from websites - Fill out forms, click buttons, or interact with web elements - Verify, validate, or test frontend UI behavior - Take screenshots of web pages - Automate multi-step web workflows - Run browser automation or check website content Powered by Midscene.js (https://midscenejs.com)
Chrome Bridge Automation
Vision-driven browser automation using Midscene Bridge mode. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. This mode connects to the user's desktop Chrome browser via the Midscene Chrome Extension, preserving cookies, sessions, and login state. Use this skill when the user wants to: - Browse, navigate, or open web pages in the user's own Chrome browser - Interact with pages that require login sessions, cookies, or existing browser state - Scrape, extract, or collect data from websites using the user's real browser - Fill out forms, click buttons, or interact with web elements - Verify, validate, or test frontend UI behavior - Take screenshots of web pages - Automate multi-step web workflows - Check website content or appearance Powered by Midscene.js (https://midscenejs.com)