返回 Skill 列表
extension
分类: 其它需要 API Key

Screen Vision

AI屏幕视觉与桌面控制技能,让AI智能体能够看到屏幕、理解界面元素,并自主执行鼠标和键盘操作。

person作者: guitu917hubclawhub

Screen Vision

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials. You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL. See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

  1. Take screenshotbash scripts/platform/screenshot.sh [output_path] [display]
  2. Analyze with AIpython3 scripts/vision/analyze.py --image <path> --task "<task>"
  3. Execute actionpython3 scripts/platform/execute.py --action <type> [options]
  4. Full task looppython3 scripts/core/run_task.py --task "<task>"

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

| Platform | Screenshot | Mouse/Keyboard | Notes | |----------|-----------|----------------|-------| | Linux | scrot | xdotool | Headless: XFCE4 + VNC | | macOS | screencapture | cliclick | Needs Accessibility permission | | Windows | pyautogui | pyautogui | No extra setup needed |

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

| Model | Provider | Cost/Task | Quality | |-------|----------|-----------|---------| | Qwen3-VL-32B | SiliconFlow | Low | ★★★★ | | GLM-4V-Plus | Zhipu BigModel | Low | ★★★★ | | GPT-5.4-Mini | OpenAI / relays | Medium | ★★★★★ | | GPT-5.4 CUA | OpenAI | High | ★★★★★ | | Llama 3.2 Vision | Ollama (local) | Free | ★★ |

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

  • click — Click at (x, y). Supports left/right/double-click.
  • type — Type text string.
  • key — Press a key (Return, Tab, Escape, etc.).
  • scroll — Scroll up or down.
  • drag — Drag from (x1,y1) to (x2,y2).
  • wait — Wait for screen to update.
  • done — Task complete.
  • failed — Cannot complete task.

Safety

  • Blocked: rm -rf, format disk, shutdown, drop database, etc.
  • Confirmation required: delete, sudo, payment-related operations
  • Limits: max 5 minutes, max 100 actions per task
  • Logging: all screenshots saved to /tmp/screen-vision/logs/
  • Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

| Variable | Default | Description | |----------|---------|-------------| | SV_VISION_API_KEY | — | Vision API key | | SV_VISION_BASE_URL | — | API endpoint (required) | | SV_VISION_MODEL | — | Vision model name (required) | | SV_DISPLAY | :1 | X11 display (Linux) | | SV_MAX_DURATION | 5 | Max task duration (min) | | SV_MAX_ACTIONS | 100 | Max actions per task | | SV_SCREENSHOT_INTERVAL | 1.0 | Seconds between screenshots |