Screen Vision

Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.

Quick Start

1. Setup (one-time)

Detect platform and install dependencies:

bash scripts/setup/setup-linux.sh --headless   # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop     # Linux with desktop
bash scripts/setup/setup-mac.sh                 # macOS
python scripts/setup/setup-win.py          # Windows

2. Configure API

Copy config.example.json to config.json and fill in your vision API credentials. You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.

{
  "vision": {
    "baseUrl": "https://api.siliconflow.cn/v1",
    "apiKey": "sk-your-key",
    "model": "Qwen/Qwen3-VL-32B"
  }
}

Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL. See references/API_CONFIG.md for all supported providers and detailed setup.

3. Usage

The skill operates through a screenshot-analyze-action loop:

Take screenshot → bash scripts/platform/screenshot.sh [output_path] [display]
Analyze with AI → python3 scripts/vision/analyze.py --image <path> --task "<task>"
Execute action → python3 scripts/platform/execute.py --action <type> [options]
Full task loop → python3 scripts/core/run_task.py --task "<task>"

Architecture

User task → run_task.py (orchestrator)
  ├── screenshot.sh (capture screen)
  ├── diff_check.py (detect changes, skip if unchanged → saves tokens)
  ├── analyze.py (send screenshot + task to vision API)
  ├── safety_check.py (block dangerous operations)
  ├── execute.py (xdotool/cliclick/pyautogui)
  └── loop until done or timeout

Platform Tools

| Platform | Screenshot | Mouse/Keyboard | Notes | |----------|-----------|----------------|-------| | Linux | scrot | xdotool | Headless: XFCE4 + VNC | | macOS | screencapture | cliclick | Needs Accessibility permission | | Windows | pyautogui | pyautogui | No extra setup needed |

See references/PLATFORM_GUIDE.md for platform-specific commands.

Vision Providers

Supports any OpenAI-compatible vision API. You choose the provider and model.

Recommended Models

| Model | Provider | Cost/Task | Quality | |-------|----------|-----------|---------| | Qwen3-VL-32B | SiliconFlow | Low | ★★★★ | | GLM-4V-Plus | Zhipu BigModel | Low | ★★★★ | | GPT-5.4-Mini | OpenAI / relays | Medium | ★★★★★ | | GPT-5.4 CUA | OpenAI | High | ★★★★★ | | Llama 3.2 Vision | Ollama (local) | Free | ★★ |

See references/API_CONFIG.md for per-provider configuration examples.

No defaults are hardcoded — you must configure your own API credentials before use.

Action Types

click — Click at (x, y). Supports left/right/double-click.
type — Type text string.
key — Press a key (Return, Tab, Escape, etc.).
scroll — Scroll up or down.
drag — Drag from (x1,y1) to (x2,y2).
wait — Wait for screen to update.
done — Task complete.
failed — Cannot complete task.

Safety

Blocked: rm -rf, format disk, shutdown, drop database, etc.
Confirmation required: delete, sudo, payment-related operations
Limits: max 5 minutes, max 100 actions per task
Logging: all screenshots saved to /tmp/screen-vision/logs/
Auto-stop on error or API failure

Examples

See references/EXAMPLES.md for usage examples.

Config

| Variable | Default | Description | |----------|---------|-------------| | SV_VISION_API_KEY | — | Vision API key | | SV_VISION_BASE_URL | — | API endpoint (required) | | SV_VISION_MODEL | — | Vision model name (required) | | SV_DISPLAY | :1 | X11 display (Linux) | | SV_MAX_DURATION | 5 | Max task duration (min) | | SV_MAX_ACTIONS | 100 | Max actions per task | | SV_SCREENSHOT_INTERVAL | 1.0 | Seconds between screenshots |