What This Skill Does

The computer-vision-expert skill provides OpenClaw agents with state-of-the-art visual perception capabilities, leveraging the 2026 industry-standard suite of models. It serves as an architectural bridge between high-performance real-time detection (YOLO26) and semantic visual reasoning (SAM 3 and VLMs). This skill enables your agent to perform complex computer vision tasks such as zero-shot segmentation, 3D scene reconstruction, and conversational image analysis without requiring extensive custom model training. By utilizing advanced NMS-free architectures and vision-language integration, this skill simplifies the deployment of industrial-grade visual pipelines on edge devices, enabling autonomous systems to "see" and understand their environment with unprecedented accuracy.

Installation

To integrate this skill into your agent, run the following command in your terminal: clawhub install openclaw/skills/skills/zorrong/computer-vision-expert

Use Cases

Autonomous Industrial Inspection: Deploy YOLO26 to rapidly detect defects on a conveyor belt, followed by SAM 3 to generate precise masks for automated measuring.
Spatial Awareness & Robotics: Use Depth Anything V2 combined with Visual SLAM for real-time obstacle avoidance and navigation in unknown indoor environments.
Automated Data Extraction: Utilize Qwen2-VL or PaliGemma 2 to perform visual question answering on complex engineering schematics or logistics manifests to convert visual data into structured JSON formats.
Edge Optimization: Transition from heavy, research-focused models to efficient ONNX-exported architectures optimized for TensorRT and NPU deployment.

Example Prompts

"Analyze this live camera feed and identify all safety hazards; output the coordinates and a bounding box for each object detected by YOLO26."
"Use SAM 3 to create a precise mask of the blue container in the foreground, then perform a 3D reconstruction of the area within that segment."
"Look at these architectural blueprints and answer the following: How many structural load-bearing points are clearly visible and what is their approximate spatial relation to the main entry?"

Tips & Limitations

Performance Strategy: Always pair YOLO26 with SAM 3; use the former for broad-sweep candidate detection and the latter for pixel-perfect refinement to maintain high FPS.
Resource Management: While YOLO26 is optimized for edge devices, SAM 3 and advanced VLMs can be compute-intensive. Ensure your local hardware or cloud inference endpoint supports CUDA acceleration for optimal latency.
Calibration: For accurate 3D reconstruction, ensure your camera extrinsic and intrinsic matrices are calibrated using the provided Sub-pixel Calibration utility before attempting depth estimation tasks.
Data Privacy: Be mindful of local regulations when using visual grounding models on personally identifiable information.

computer-vision-expert

Why use this skill?

Install via CLI (Recommended)

What This Skill Does

Installation

Use Cases

Example Prompts

Tips & Limitations

Metadata

Tags(AI)