返回 Skill 列表
extension
分类: 开发与工程无需 API Key

Clawd Cursor 0.6.0

AI桌面代理——通过OpenClaw代理控制Windows/macOS上的任何应用。将自然语言任务发送给Clawd Cursor API,它会自动完成所有操作:……

person作者: sieyerhubclawhub

Clawd Cursor

If a human can do it on a screen, you can too. No API? No integration? No problem. Every app already has a UI — Clawd Cursor gives you eyes and hands to use all of them.

No task is impossible. If it's on the screen, it's within reach.

IMPORTANT — READ THIS BEFORE ANYTHING ELSE

You have a desktop AI agent over GUI running on the user's machine. This means:

  1. You CAN open apps, click buttons, type text, fill forms
  2. You CAN see what's on screen (via Clawd Cursor's screenshot pipeline)
  3. You CAN interact with applications — browser, desktop apps, settings
  4. You MUST respect safety tiers — 🟢 Auto actions run freely, 🔴 Confirm actions require user approval
  5. You MUST ask before accessing sensitive apps (email, banking, private messages, password managers)
  6. You SHOULD prefer direct tools (file read, web_fetch, CLI) over desktop automation when available

Think of Clawd Cursor as your hands and eyes on the user's computer — used with their permission.


When to Use This (AGENTS: READ THIS)

You have a desktop UI agent available. Use it when the user asks you to interact with their desktop, or when no other tool can accomplish the task.

Use Clawd Cursor for:

  • User-requested desktop tasks — "open Gmail and send an email", "check my calendar"
  • Read a webpage — when web_fetch or browser tools aren't available
  • Interact with desktop apps — click buttons, fill forms, read results
  • Browser tasks — search, navigate, fill forms (when browser tool unavailable)
  • Visual verification — did the page load? what does the UI show?
  • Cross-app workflows — copy from one app, paste in another
  • Settings changes — when the user explicitly asks

⚠️ Sensitive App Policy

Always ask the user before accessing:

  • Email clients (Gmail, Outlook)
  • Banking or financial apps
  • Private messaging (WhatsApp, Signal, Telegram)
  • Password managers
  • Admin panels or cloud consoles

Don't use Clawd Cursor when:

  • You can do it with a direct API call or CLI command (faster)
  • The task is purely computational (math, text generation, code writing)
  • You can already read/write the file directly
  • The browser tool or web_fetch can handle it

Direct Browser Access (Fast Path)

For quick page reads without a full task, connect to Chrome via Playwright CDP:

const pw = require('playwright');
const browser = await pw.chromium.connectOverCDP('http://127.0.0.1:9222');
const pages = browser.contexts()[0].pages();
const text = await pages[0].innerText('body');

Use this when you just need page content — faster than sending a task.

| Scenario | Use | Why | |----------|-----|-----| | Read page content/text | CDP Direct | Instant, free | | Fill a web form | REST API | Clawd handles multi-step planning | | Check if a page loaded | CDP Direct | Just read the title/URL | | Click through a complex UI flow | REST API | Clawd handles planning | | Get a list of elements on page | CDP Direct | Fast DOM query | | Interact with a desktop app | REST API | CDP is browser-only |


REST API Reference

Base URL: http://127.0.0.1:3847

Note: On Windows PowerShell, use curl.exe (with .exe) or Invoke-RestMethod. Bare curl is aliased to Invoke-WebRequest which behaves differently.

Pre-flight Check

Before your first task, verify Clawd Cursor is running:

curl.exe -s http://127.0.0.1:3847/health

Expected: {"status":"ok","version":"0.6.0"}

If connection refused — start it yourself (don't ask the user):

# Find the skill directory and start the server
Start-Process -FilePath "node" -ArgumentList "dist/index.js","start" -WorkingDirectory "<clawd-cursor-directory>" -WindowStyle Hidden
Start-Sleep 3
# Verify it's running
curl.exe -s http://127.0.0.1:3847/health

The skill directory is wherever SKILL.md lives (the parent of this file). Use that path as the working directory.

Sending a Task (Async — Returns Immediately)

POST /task accepts the task and returns immediately. The task runs in the background. You must poll /status to know when it's done.

curl.exe -s -X POST http://127.0.0.1:3847/task -H "Content-Type: application/json" -d "{\"task\": \"YOUR_TASK_HERE\"}"

PowerShell:

Invoke-RestMethod -Uri http://127.0.0.1:3847/task -Method POST -ContentType "application/json" -Body '{"task": "YOUR_TASK_HERE"}'

Polling Pattern (Follow This)

1. POST /task → get accepted
2. Wait 2 seconds
3. GET /status
4. If status is "idle" → done
5. If status is "waiting_confirm" → ASK THE USER, then POST /confirm based on their answer
6. If still running → wait 2 more seconds, go to step 3
7. If 60+ seconds → POST /abort and retry with clearer instructions

Checking Status

curl.exe -s http://127.0.0.1:3847/status

Confirming Safety-Gated Actions

Some actions (sending messages, deleting) require approval. 🔴 NEVER self-approve these. Always ask the user for confirmation before POST /confirm. These exist to protect the user — do not bypass them.

curl.exe -s -X POST http://127.0.0.1:3847/confirm -H "Content-Type: application/json" -d "{\"approved\": true}"

Aborting a Task

curl.exe -s -X POST http://127.0.0.1:3847/abort

Reading Logs (Debugging)

curl.exe -s http://127.0.0.1:3847/logs

Returns last 200 log entries. Check for error or warn entries when tasks fail.

Response States

| State | Response | What to do | |-------|----------|------------| | Accepted | {"accepted": true, "task": "..."} | Start polling | | Running | {"status": "acting", "currentTask": "...", "stepsCompleted": 2} | Keep polling | | Waiting confirm | {"status": "waiting_confirm", "currentStep": "..."} | POST /confirm | | Done | {"status": "idle"} | Task complete | | Busy | {"error": "Agent is busy", "state": {...}} | Wait or POST /abort first |


CDP Direct Reference

Chrome must be running with --remote-debugging-port=9222.

Quick check:

curl.exe -s http://127.0.0.1:9222/json/version

If this returns JSON, Chrome is ready.

Connecting via Playwright:

const { chromium } = require('playwright');
const browser = await chromium.connectOverCDP('http://127.0.0.1:9222');
const context = browser.contexts()[0];
const page = context.pages()[0];

// Read page content
const title = await page.title();
const url = page.url();
const text = await page.textContent('body');

// Click by role
await page.getByRole('button', { name: 'Submit' }).click();

// Fill a field
await page.getByLabel('Email').fill('user@example.com');

// Read specific elements
const buttons = await page.$$eval('button', els => els.map(e => e.textContent));

Task Writing Guidelines

  1. Be specific — include app names, URLs, exact text to type, button names
  2. One task at a time — wait for completion before sending the next
  3. Describe the goal, not the clicks — say "Send an email to john@example.com about the meeting" not "click compose, click to field..."
  4. Check status if a task seems to hang
  5. Don't include credentials in task text — tasks are logged

Task Examples

| Goal | Task to send | |------|-------------| | Simple navigation | Open Chrome and go to github.com | | Read screen content | What text is currently displayed in Notepad? | | Cross-app workflow | Copy the email address from the Chrome tab and paste it into the To field in Outlook | | Form filling | In the open Chrome tab, fill the contact form: name "John Doe", email "john@example.com" | | App interaction | Open Spotify and play the Discover Weekly playlist | | Settings change | Open Windows Settings and turn on Dark Mode | | Data extraction | Read the stock price shown in the Bloomberg tab in Chrome | | Complex browser | Open YouTube, search for "Adele Hello", and play the first video result | | Verification | Check if the deployment succeeded — look at the Vercel dashboard in Chrome | | Send email | Open Gmail, compose email to john@example.com, subject: Meeting Tomorrow, body: Confirming 2pm. Best regards. | | Take screenshot | Take a screenshot |

Error Recovery

| Problem | Solution | |---------|----------| | Connection refused on :3847 | Start Clawd Cursor: cd clawd-cursor && npm start | | Connection refused on :9222 | Start Chrome with CDP: Start-Process chrome -ArgumentList "--remote-debugging-port=9222" | | Agent returns "busy" | Poll /status — wait for idle, or POST /abort | | Task fails with no details | Check /logs for error entries | | Task completes but wrong result | Rephrase with more specifics: exact app name, button text, field labels | | Same task fails repeatedly | Break into smaller tasks (one action per task) | | Safety confirmation pending | POST /confirm with {"approved": true} or {"approved": false} | | Task hangs > 60 seconds | POST /abort, then retry with simpler phrasing |


How It Works — 4-Layer Pipeline

| Layer | What | Speed | Cost | |-------|------|-------|------| | 0: Browser Layer | URL detection → direct navigation | Instant | Free | | 1: Action Router | Regex + UI Automation | Instant | Free | | 1.5: Smart Interaction | 1 LLM plan → CDP/UIDriver executes | ~2-5s | 1 LLM call | | 2: Accessibility Reasoner | UI tree → text LLM decides | ~1s | Cheap | | 3: Computer Use | Screenshot → vision LLM | ~5-8s | Expensive |

80%+ of tasks handled by Layer 0-1 (free, instant). Vision model is last resort only.

Safety Tiers

| Tier | Actions | Behavior | |------|---------|----------| | 🟢 Auto | Navigation, reading, opening apps | Runs immediately | | 🟡 Preview | Typing, form filling | Logs before executing | | 🔴 Confirm | Sending messages, deleting | Pauses — ask the user before POST /confirm. Never self-approve. |

Security & Privacy

Network Isolation

  • API binds to 127.0.0.1 only — not network accessible. Verify: netstat -an | findstr 3847 should show 127.0.0.1:3847
  • Screenshots stay in memory, never saved to disk (unless --debug)
  • No telemetry, no analytics, no phone-home calls

Data Flow

  • With Ollama (local): 100% offline — zero external network calls. No data leaves the machine.
  • With cloud providers: screenshots/text are sent to the user's chosen provider API only. No data goes to skill authors, ClawHub, or third parties.
  • OpenClaw users: credentials auto-discovered from local config files — no keys stored in skill directory.
  • The user controls data flow by choosing their provider. Ollama = fully private.

Agent Autonomy Controls

  • 🟢 Auto actions (navigation, reading, opening apps) run without prompting
  • 🟡 Preview actions (typing, form filling) are logged before executing
  • 🔴 Confirm actions (sending messages, deleting, purchases) always pause for user approval
  • Agents must ask the user before accessing sensitive apps (email, banking, messaging, passwords)
  • Agents must never self-approve 🔴 Confirm actions

Setup (User Reference)

Setup is handled by the user. If Clawd Cursor isn't running, start it yourself using the exec tool:

Start-Process -FilePath "node" -ArgumentList "dist/index.js","start" -WorkingDirectory "<skill-directory>" -WindowStyle Hidden

Only ask the user if you cannot start it (e.g., node not installed, build missing).

git clone https://github.com/AmrDab/clawd-cursor.git
cd clawd-cursor
npm install && npm run build
npx clawd-cursor doctor    # auto-detects and configures everything
npm start                  # starts on port 3847

macOS: Grant Accessibility permission to terminal: System Settings → Privacy & Security → Accessibility

| Provider | Setup | Cost | |----------|-------|------| | Ollama (free) | ollama pull <model> | $0 (fully offline) | | Any cloud provider | Set AI_API_KEY=your-key | Varies by provider | | OpenClaw users | Automatic — no setup needed | Uses configured provider |


Performance Optimization

Proven optimizations applied to reduce task execution latency and LLM API costs. Reference files in perf/references/patches/.

Applied Optimizations

| # | Name | Impact | |---|------|--------| | 1 | Screenshot hash cache | 90% fewer LLM calls on static screens | | 2 | Parallel screenshot+a11y | 30-40% per-step latency cut | | 3 | A11y context cache (2s TTL) | Eliminates redundant PS spawns | | 4 | Screenshot compression | 52% smaller payload (58KB vs 120KB) | | 5 | Async debug writes | 94% less event loop blocking | | 6 | Streaming LLM responses | 1-3s faster per LLM call | | 7 | Trimmed system prompts | ~60% fewer prompt tokens | | 8 | A11y tree filtering | Interactive elements only, 3000 char cap | | 9 | Combined PS script | 1 spawn instead of 3 | | 10 | Taskbar cache (30s TTL) | Skip expensive taskbar query | | 11 | Delay reduction | 50-150ms vs 200-1500ms |

Benchmarks (2560x1440)

| Metric | v0.3 (VNC) | v0.4 (Native) | v0.4.1+ (Optimized) | |--------|------------|---------------|----------------------| | Screenshot capture | ~850ms | ~50ms | ~57ms | | Screenshot size | ~200KB | ~120KB | ~58KB | | A11y context (uncached) | N/A | ~600ms | ~462ms | | A11y context (cached) | N/A | 0ms | 0ms (2s TTL) | | Delays (per step) | N/A | 200-1500ms | 50-600ms | | System prompt tokens | N/A | ~800 | ~300 |

Perf Tools

  • perf/apply-optimizations.ps1 — apply all patches
  • perf/perf-test.ts — benchmark harness (npx ts-node perf/perf-test.ts)