surfagent-perception
Agent vision for web pages — scene summaries, attention-ranked elements, annotated screenshots, and state diffing via SurfAgent's perception engine.
Install via CLI (Recommended)
clawhub install openclaw/skills/skills/agentossoftware/surfagent-perceptionSurfAgent Perception — Agent Vision Skill
How to see, understand, and verify web pages through SurfAgent's perception engine.
What This Is
SurfAgent Perception gives you human-like page understanding in ~200 tokens instead of parsing a 50K-token DOM. Three MCP tools, one workflow loop.
Without perception: You get a raw DOM dump or a dumb screenshot. You have to figure out what's on the page yourself.
With perception: You get a scene summary, ranked interactive elements, spatial clusters, viewport state, and optionally an annotated screenshot with numbered bounding boxes + a legend mapping each number to a ref.
Requires: SurfAgent daemon running (port 7201) with a managed Chrome instance (port 9222).
The Three Tools
surf_perceive — Your Primary Eyes
The main tool. Call this to understand what's on screen.
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
tabId | string | active tab | Target a specific tab |
since | string | — | State token from a previous call. Includes delta of what changed |
maxAnnotations | number | 15 | How many elements to rank (1-50) |
annotate | boolean | false | Include annotated screenshot with numbered bounding boxes |
Returns:
- Scene summary — one-liner + top 5 ranked actions + state notes (blockers, modals, forms, auth, scroll %)
- Viewport info — scroll position, fold split, document dimensions
- Top elements — attention-ranked, with refs for clicking/typing
- Clusters — semantic groups of related elements (nav cluster, form cluster, etc.)
- State token — pass as
sincenext time to get delta - Annotated screenshot + legend (if
annotate: true)
When to use: Start of any page interaction. After navigation. When you need to understand the page before acting.
surf_annotate — Quick Visual Reference
Lighter than surf_perceive. Just the annotated screenshot + legend, no scene analysis.
Parameters:
| Param | Type | Default | Description |
|---|---|---|---|
tabId | string | active tab | Target a specific tab |
maxAnnotations | number | 15 | How many elements to annotate (1-50) |
Returns:
- Annotated screenshot (base64 PNG) with numbered colored bounding boxes
- Legend mapping each number to element ref, role, location, and status
When to use: When you already know the page context but need to identify specific elements visually. Good for "which button do I click?" scenarios.
surf_scene_diff — What Changed?
Compare current state to a previous state token. Answers: "did my action work?"
Parameters:
| Param | Type | Required | Description |
|---|---|---|---|
since | string | ✅ | State token from a previous surf_perceive call |
tabId | string | — | Target a specific tab |
Metadata
Not sure this is the right skill?
Describe what you want to build — we'll match you to the best skill from 16,000+ options.
Find the right skillPaste this into your clawhub.json to enable this plugin.
{
"plugins": {
"official-agentossoftware-surfagent-perception": {
"enabled": true,
"auto_update": true
}
}
}Related Skills
surfagent-browser
Control a real Chrome browser from your AI agent — navigate, click, type, fill forms, extract content, manage tabs, and automate workflows via SurfAgent's REST API.
Agentos
Skill by agentossoftware
surfagent
Control a real Chrome browser via SurfAgent — navigate, click, type, screenshot, extract data, crawl sites, and automate web workflows. Uses your persistent Chrome profile with real cookies and sessions. Works through SurfAgent's MCP server or direct HTTP API.
Agentos Sdk
Skill by agentossoftware
Agentos Mesh
Skill by agentossoftware