agent-desktop
CLI tool enabling AI agents to observe and control desktop applications via native OS accessibility trees.
Core principle: agent-desktop is NOT an AI agent. It is a tool that AI agents invoke. It outputs structured JSON with ref-based element identifiers. The observation-action loop lives in the calling agent.
Installation
npm install -g agent-desktop
# or
bun install -g --trust agent-desktop
Requires macOS 12+ with Accessibility permission granted to your terminal. Screen Recording permission is also required for screenshots.
Reference Files
Detailed documentation is split into focused reference files. Read them as needed:
| Reference | Contents |
|-----------|----------|
| references/commands-observation.md | snapshot, find, get, is, screenshot, list-surfaces — all flags, output examples |
| references/commands-interaction.md | click, type, set-value, select, toggle, scroll, drag, keyboard, mouse — choosing the right command |
| references/commands-system.md | launch, close, windows, clipboard, wait, batch, status, permissions, version |
| references/workflows.md | 12 common patterns: forms, menus, dialogs, scroll-find, drag-drop, async wait, anti-patterns |
| references/macos.md | macOS permissions/TCC, AX API internals, smart activation chain, surfaces, Notification Center, troubleshooting |
The Observe-Act Loop (Progressive Skeleton Traversal)
Use progressive skeleton traversal as the default approach. It reduces token consumption 78-96% for dense apps by exploring the UI in two phases: a shallow skeleton overview, then targeted drill-downs into regions of interest.
1. SKELETON → agent-desktop snapshot --skeleton --app "App" -i --compact
Parse the overview. Identify the region containing your target.
Regions show children_count (e.g., "Sidebar" with children_count: 42).
Named containers at truncation boundary have refs for drill-down.
Keep the returned snapshot_id.
2. DRILL → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Expand the target region. Now you see its interactive elements.
3. ACT → agent-desktop click @e12 --snapshot <snapshot_id> (or type, select, toggle...)
4. VERIFY → agent-desktop snapshot --root @e3 --snapshot <snapshot_id> -i --compact
Re-drill the same region to confirm the state change.
Scoped invalidation: only @e3's subtree refs are replaced.
5. REPEAT → Continue drilling other regions or acting as needed.
When to skip skeleton and use full snapshot instead:
- Simple apps with few elements (Finder, Calculator, TextEdit)
- You already know the exact element name — use
findinstead - Surface snapshots (menus, sheets, alerts) — these are already focused
When skeleton shines:
- Dense Electron apps (Slack, VS Code, Discord, Notion)
- Any app where full snapshot exceeds ~50 refs
- Multi-region workflows (sidebar + main content + toolbar)
Ref System
- Refs assigned depth-first:
@e1,@e2,@e3... - Only interactive elements get refs: button, textfield, checkbox, link, menuitem, tab, slider, combobox, treeitem, cell
- In skeleton mode, named/described containers at truncation boundary also get refs (drill-down targets with empty
available_actions) - Static text, groups, containers remain in tree for context but have no ref
- Refs are deterministic within a snapshot but NOT stable across snapshots if UI changed
- Every snapshot returns
snapshot_id; ref-consuming commands accept--snapshot <snapshot_id> last_refmap.jsonis only a latest-snapshot inspection artifact. The command path uses snapshot-scoped storage.- After any action that changes UI, re-drill the affected region or re-snapshot
- Scoped invalidation: re-drilling
--root @e3only replaces refs from @e3's previous drill — refs from other regions and the skeleton itself are preserved
JSON Output Contract
Every command returns a JSON envelope on stdout:
Success: { "version": "2.0", "ok": true, "command": "snapshot", "data": { ... } }
Error: { "version": "2.0", "ok": false, "command": "click", "error": { "code": "STALE_REF", "message": "...", "suggestion": "..." } }
Exit codes: 0 success, 1 structured error, 2 argument error.
Error Codes
| Code | Meaning | Recovery |
|------|---------|----------|
| PERM_DENIED | Accessibility or Screen Recording permission not granted | Grant the named permission in System Settings |
| ELEMENT_NOT_FOUND | Ref cannot be resolved against the live UI | Re-run snapshot, use fresh ref |
| APP_NOT_FOUND | App not running | Launch it first |
| ACTION_FAILED | AX action rejected | Try an explicit alternative command |
| ACTION_NOT_SUPPORTED | Element can't do this | Use different command |
| STALE_REF | Ref from old snapshot | Re-run snapshot |
| SNAPSHOT_NOT_FOUND | Snapshot ID is missing or expired | Run snapshot again and use the returned ID |
| POLICY_DENIED | A physical/headed path was blocked | Use an explicit mouse/focus/keyboard command if physical interaction is intended |
| WINDOW_NOT_FOUND | No matching window | Check app name, use list-windows |
| PLATFORM_NOT_SUPPORTED | Adapter method not implemented on this platform | Use a supported platform adapter |
| TIMEOUT | Wait condition not met | Increase --timeout |
| INVALID_ARGS | Bad arguments | Check command syntax |
| NOTIFICATION_NOT_FOUND | Notification index no longer exists | Re-run list-notifications |
Command Quick Reference (54 commands)
Observation
agent-desktop snapshot --skeleton --app "App" -i --compact # Skeleton overview (preferred)
agent-desktop snapshot --root @e3 -i --compact # Drill into region
agent-desktop snapshot --app "App" -i # Full tree (simple apps)
agent-desktop snapshot --app "App" --surface menu -i # Surface snapshot
agent-desktop screenshot --app "App" out.png # PNG screenshot
agent-desktop find --app "App" --role button # Search elements
agent-desktop get @e1 --snapshot <snapshot_id> --property text # Read element property
agent-desktop is @e1 --snapshot <snapshot_id> --property enabled # Check element state
agent-desktop list-surfaces --app "App" # Available surfaces
Interaction
agent-desktop click @e5 --snapshot <snapshot_id> # AX-first click, no cursor move by default
agent-desktop double-click @e3 # AXOpen; physical double-click uses mouse-click --count 2
agent-desktop triple-click @e2 # Physical triple-click uses mouse-click --count 3
agent-desktop right-click @e5 # Right-click; menu returned when verified
agent-desktop type @e2 --snapshot <snapshot_id> "hello" # Headless AX text insertion when supported
agent-desktop set-value @e2 "new value" # Set value directly
agent-desktop clear @e2 # Clear element value
agent-desktop focus @e2 # Set keyboard focus
agent-desktop select @e4 "Option B" # Select dropdown/list option
agent-desktop toggle @e6 # Toggle checkbox/switch
agent-desktop check @e6 # Idempotent check
agent-desktop uncheck @e6 # Idempotent uncheck
agent-desktop expand @e7 # Expand disclosure
agent-desktop collapse @e7 # Collapse disclosure
agent-desktop scroll @e1 --direction down # Scroll element
agent-desktop scroll-to @e8 # Scroll into view
Keyboard & Mouse
agent-desktop press cmd+c # Key combo
agent-desktop press return --app "App" # Targeted key press
agent-desktop key-down shift # Hold key
agent-desktop key-up shift # Release key
agent-desktop hover @e5 # Explicit cursor movement
agent-desktop hover --xy 500,300 # Cursor to coordinates
agent-desktop drag --from @e1 --to @e5 # Drag between elements
agent-desktop mouse-click --xy 500,300 # Click at coordinates
agent-desktop mouse-move --xy 100,200 # Move cursor
agent-desktop mouse-down --xy 100,200 # Press mouse button
agent-desktop mouse-up --xy 300,400 # Release mouse button
App & Window
agent-desktop launch "System Settings" # Launch and wait
agent-desktop close-app "TextEdit" # Quit gracefully
agent-desktop close-app "TextEdit" --force # Force kill
agent-desktop list-windows --app "Finder" # List windows
agent-desktop list-apps # List running GUI apps
agent-desktop focus-window --app "Finder" # Bring to front
agent-desktop resize-window --app "App" --width 800 --height 600
agent-desktop move-window --app "App" --x 0 --y 0
agent-desktop minimize --app "App"
agent-desktop maximize --app "App"
agent-desktop restore --app "App"
Notifications
agent-desktop list-notifications # List all notifications
agent-desktop list-notifications --app "Slack" # Filter by app
agent-desktop list-notifications --text "deploy" --limit 5 # Filter by text
agent-desktop dismiss-notification 1 # Dismiss by index
agent-desktop dismiss-all-notifications # Dismiss all
agent-desktop dismiss-all-notifications --app "Slack" # Dismiss all from app
agent-desktop notification-action 1 "Reply" --expected-app Slack # Click action (with NC reorder guard)
Clipboard
agent-desktop clipboard-get # Read clipboard
agent-desktop clipboard-set "text" # Write to clipboard
agent-desktop clipboard-clear # Clear clipboard
Wait
agent-desktop wait 1000 # Pause 1 second
agent-desktop wait --element @e5 --snapshot <snapshot_id> --timeout 5000 # Wait for element
agent-desktop wait --window "Title" # Wait for window
agent-desktop wait --text "Done" --app "App" # Wait for text
agent-desktop wait --menu --app "App" # Wait for menu surface
agent-desktop wait --menu-closed --app "App" # Wait for menu dismissal
agent-desktop wait --notification --app "App" # Wait for new notification
System
agent-desktop status # Health check
agent-desktop permissions # Check permission
agent-desktop permissions --request # Trigger permission dialog
agent-desktop version --json # Version info
agent-desktop batch '[...]' --stop-on-error # Batch uses the same typed command path as CLI
agent-desktop skills # List bundled skill docs
agent-desktop skills get desktop --full # Load this skill + all references
Key Principles for Agents
- Skeleton first, drill second. Start with
--skeleton -i --compactfor dense apps. Drill into regions with--root @ref. Full snapshot only for simple apps. - Use
-i --compactflags. Filters to interactive elements and collapses empty wrappers, minimizing tokens. - Refs are snapshot-scoped. Keep
snapshot_idfor deterministic multi-step use; re-drill the affected region after any UI-changing action. Scoped invalidation keeps other refs intact. - Prefer refs over coordinates.
click @e5>mouse-click --xy 500,300. - Use
waitfor async UI. After launch/dialog triggers, wait for expected state. - Check permissions first. Run
permissionson first use; screenshots also need Screen Recording. - Handle errors. Parse
error.codeand followerror.suggestion. - Use
findfor targeted searches. Faster than any snapshot when you know role/name. - Use surfaces for overlays.
snapshot --surface menufor menus,--surface sheetfor dialogs. Never--skeletonfor surfaces — they're already focused. - Batch for performance. Multiple commands in one invocation.
- Headless by default. Ref actions use semantic AX paths and block silent focus stealing, cursor movement, keyboard synthesis, and pasteboard insertion. Use explicit
focus,press,hover,drag, ormouse-*commands only when physical/headed interaction is intended.
微信扫一扫