Official Verified

Android Device Automation

Vision-driven Android device automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control Android devices with natural language commands via ADB. Perform taps, swipes, text input, app launches, screenshots, and more. Trigger keywords: android, phone, mobile app, tap, swipe, install app, open app on phone, android device, mobile automation, adb, launch app, mobile screen Powered by Midscene.js (https://midscenejs.com)

skill-install — Terminal

Install via CLI (Recommended)

clawhub install openclaw/skills/skills/quanru/midscene-android-automation

Download Source Code (.zip)

Android Device Automation

CRITICAL RULES — VIOLATIONS WILL BREAK THE WORKFLOW:

Never run midscene commands in the background. Each command must run synchronously so you can read its output (especially screenshots) before deciding the next action. Background execution breaks the screenshot-analyze-act loop.

Run only one midscene command at a time. Wait for the previous command to finish, read the screenshot, then decide the next action. Never chain multiple commands together.

Allow enough time for each command to complete. Midscene commands involve AI inference and screen interaction, which can take longer than typical shell commands. A typical command needs about 1 minute; complex act commands may need even longer.

Automate Android devices using npx @midscene/android@1. Each CLI command maps directly to an MCP tool — you (the AI agent) act as the brain, deciding which actions to take based on screenshots.

Prerequisites

Midscene requires models with strong visual grounding capabilities. The following environment variables must be configured — either as system environment variables or in a .env file in the current working directory (Midscene loads .env automatically):

MIDSCENE_MODEL_API_KEY="your-api-key"
MIDSCENE_MODEL_NAME="model-name"
MIDSCENE_MODEL_BASE_URL="https://..."
MIDSCENE_MODEL_FAMILY="family-identifier"

Example: Gemini (Gemini-3-Flash)

MIDSCENE_MODEL_API_KEY="your-google-api-key"
MIDSCENE_MODEL_NAME="gemini-3-flash"
MIDSCENE_MODEL_BASE_URL="https://generativelanguage.googleapis.com/v1beta/openai/"
MIDSCENE_MODEL_FAMILY="gemini"

Example: Qwen3-VL

MIDSCENE_MODEL_API_KEY="your-openrouter-api-key"
MIDSCENE_MODEL_NAME="qwen/qwen3-vl-235b-a22b-instruct"
MIDSCENE_MODEL_BASE_URL="https://openrouter.ai/api/v1"
MIDSCENE_MODEL_FAMILY="qwen3-vl"

Example: Doubao Seed 1.6

MIDSCENE_MODEL_API_KEY="your-doubao-api-key"
MIDSCENE_MODEL_NAME="doubao-seed-1-6-250615"
MIDSCENE_MODEL_BASE_URL="https://ark.cn-beijing.volces.com/api/v3"
MIDSCENE_MODEL_FAMILY="doubao-vision"

Commonly used models: Doubao Seed 1.6, Qwen3-VL, Zhipu GLM-4.6V, Gemini-3-Pro, Gemini-3-Flash.

If the model is not configured, ask the user to set it up. See Model Configuration for supported providers.

Commands

Connect to Device

npx @midscene/android@1 connect
npx @midscene/android@1 connect --deviceId emulator-5554

Take Screenshot

npx @midscene/android@1 take_screenshot

After taking a screenshot, read the saved image file to understand the current screen state before deciding the next action.

Perform Action

Read Full Documentation on GitHub

Metadata

Author@quanru

Stars1171

Updated2026-02-19

View Author Profile

AI Skill Finder

Not sure this is the right skill?

Describe what you want to build — we'll match you to the best skill from 16,000+ options.

Find the right skill

Add to Configuration

Paste this into your clawhub.json to enable this plugin.

{
  "plugins": {
    "official-quanru-midscene-android-automation": {
      "enabled": true,
      "auto_update": true
    }
  }
}

Safety NoteClawKit audits metadata but not runtime behavior. Use with caution.

Related Skills

iOS Device Automation

Vision-driven iOS device automation using Midscene CLI. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Control iOS devices with natural language commands via WebDriverAgent. Triggers: ios, iphone, ipad, ios app, tap on iphone, swipe, mobile app ios, ios device, ios testing, iphone automation, ipad automation, ios screen, ios navigate Powered by Midscene.js (https://midscenejs.com)

quanru 1171

Desktop Computer Automation

Vision-driven desktop automation using Midscene. Control your desktop (macOS, Windows, Linux) with natural language commands. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Triggers: open app, press key, desktop, computer, click on screen, type text, screenshot desktop, launch application, switch window, desktop automation, control computer, mouse click, keyboard shortcut, screen capture, find on screen, read screen, verify window, close app, minimize window, maximize window Powered by Midscene.js (https://midscenejs.com)

quanru 1171

Browser Automation

Vision-driven browser automation using Midscene. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. Opens a new browser tab for each target URL via Puppeteer (headless Chrome). Use this skill when the user wants to: - Browse, navigate, or open web pages - Scrape, extract, or collect data from websites - Fill out forms, click buttons, or interact with web elements - Verify, validate, or test frontend UI behavior - Take screenshots of web pages - Automate multi-step web workflows - Run browser automation or check website content Powered by Midscene.js (https://midscenejs.com)

quanru 1171

Chrome Bridge Automation

Vision-driven browser automation using Midscene Bridge mode. Operates entirely from screenshots — no DOM or accessibility labels required. Can interact with all visible elements on screen regardless of technology stack. This mode connects to the user's desktop Chrome browser via the Midscene Chrome Extension, preserving cookies, sessions, and login state. Use this skill when the user wants to: - Browse, navigate, or open web pages in the user's own Chrome browser - Interact with pages that require login sessions, cookies, or existing browser state - Scrape, extract, or collect data from websites using the user's real browser - Fill out forms, click buttons, or interact with web elements - Verify, validate, or test frontend UI behavior - Take screenshots of web pages - Automate multi-step web workflows - Check website content or appearance Powered by Midscene.js (https://midscenejs.com)

quanru 1171