Screen Vision
Control the desktop visually: screenshot → AI vision analysis → execute actions → loop until done.
Quick Start
1. Setup (one-time)
Detect platform and install dependencies:
bash scripts/setup/setup-linux.sh --headless # Linux server (no desktop)
bash scripts/setup/setup-linux.sh --desktop # Linux with desktop
bash scripts/setup/setup-mac.sh # macOS
python scripts/setup/setup-win.py # Windows
2. Configure API
Copy config.example.json to config.json and fill in your vision API credentials.
You must set baseUrl, apiKey, and model — supports any OpenAI-compatible API.
{
"vision": {
"baseUrl": "https://api.siliconflow.cn/v1",
"apiKey": "sk-your-key",
"model": "Qwen/Qwen3-VL-32B"
}
}
Environment variables also work: SV_VISION_API_KEY, SV_VISION_BASE_URL, SV_VISION_MODEL.
See references/API_CONFIG.md for all supported providers and detailed setup.
3. Usage
The skill operates through a screenshot-analyze-action loop:
- Take screenshot →
bash scripts/platform/screenshot.sh [output_path] [display] - Analyze with AI →
python3 scripts/vision/analyze.py --image <path> --task "<task>" - Execute action →
python3 scripts/platform/execute.py --action <type> [options] - Full task loop →
python3 scripts/core/run_task.py --task "<task>"
Architecture
User task → run_task.py (orchestrator)
├── screenshot.sh (capture screen)
├── diff_check.py (detect changes, skip if unchanged → saves tokens)
├── analyze.py (send screenshot + task to vision API)
├── safety_check.py (block dangerous operations)
├── execute.py (xdotool/cliclick/pyautogui)
└── loop until done or timeout
Platform Tools
| Platform | Screenshot | Mouse/Keyboard | Notes | |----------|-----------|----------------|-------| | Linux | scrot | xdotool | Headless: XFCE4 + VNC | | macOS | screencapture | cliclick | Needs Accessibility permission | | Windows | pyautogui | pyautogui | No extra setup needed |
See references/PLATFORM_GUIDE.md for platform-specific commands.
Vision Providers
Supports any OpenAI-compatible vision API. You choose the provider and model.
Recommended Models
| Model | Provider | Cost/Task | Quality | |-------|----------|-----------|---------| | Qwen3-VL-32B | SiliconFlow | Low | ★★★★ | | GLM-4V-Plus | Zhipu BigModel | Low | ★★★★ | | GPT-5.4-Mini | OpenAI / relays | Medium | ★★★★★ | | GPT-5.4 CUA | OpenAI | High | ★★★★★ | | Llama 3.2 Vision | Ollama (local) | Free | ★★ |
See references/API_CONFIG.md for per-provider configuration examples.
No defaults are hardcoded — you must configure your own API credentials before use.
Action Types
click— Click at (x, y). Supports left/right/double-click.type— Type text string.key— Press a key (Return, Tab, Escape, etc.).scroll— Scroll up or down.drag— Drag from (x1,y1) to (x2,y2).wait— Wait for screen to update.done— Task complete.failed— Cannot complete task.
Safety
- Blocked: rm -rf, format disk, shutdown, drop database, etc.
- Confirmation required: delete, sudo, payment-related operations
- Limits: max 5 minutes, max 100 actions per task
- Logging: all screenshots saved to
/tmp/screen-vision/logs/ - Auto-stop on error or API failure
Examples
See references/EXAMPLES.md for usage examples.
Config
| Variable | Default | Description |
|----------|---------|-------------|
| SV_VISION_API_KEY | — | Vision API key |
| SV_VISION_BASE_URL | — | API endpoint (required) |
| SV_VISION_MODEL | — | Vision model name (required) |
| SV_DISPLAY | :1 | X11 display (Linux) |
| SV_MAX_DURATION | 5 | Max task duration (min) |
| SV_MAX_ACTIONS | 100 | Max actions per task |
| SV_SCREENSHOT_INTERVAL | 1.0 | Seconds between screenshots |
微信扫一扫