Web Scraper Skill (Apify + Firecrawl)

This skill helps Openclaw scrape and extract data from websites using two powerful APIs:

Firecrawl — best for scraping individual pages, crawling entire sites, and getting LLM-ready content (markdown)
Apify — best for specialized scrapers (social media, Google Maps, e-commerce, etc.) via pre-built Actors

Quick Decision Guide: Apify vs Firecrawl

| Use Case | Recommended Tool | |---|---| | Scrape a single page into markdown/JSON | Firecrawl /scrape | | Crawl an entire website (follow links) | Firecrawl /crawl | | Map all URLs on a site | Firecrawl /map | | Search web + scrape results | Firecrawl /search | | Scrape Instagram / TikTok / Twitter | Apify (social actors) | | Scrape Google Maps / reviews | Apify (compass/crawler-google-places) | | Scrape Amazon products | Apify (apify/amazon-scraper) | | Scrape Google Search results | Apify (apify/google-search-scraper) | | Custom actor / any Apify Store actor | Apify |

Authentication

Both APIs require API keys passed via headers. Always ask the user for their key if not provided.

Firecrawl: Authorization: Bearer fc-YOUR_API_KEY Apify: Authorization: Bearer YOUR_APIFY_TOKEN (or ?token=YOUR_TOKEN in URL)

Firecrawl API Reference

Base URL: https://api.firecrawl.dev/v2

1. Scrape a Single Page

POST /v2/scrape
Authorization: Bearer fc-YOUR_API_KEY
Content-Type: application/json

{
  "url": "https://example.com",
  "formats": ["markdown"],          // Options: markdown, html, rawHtml, links, screenshot, json
  "onlyMainContent": true,          // Strips nav/footer/ads
  "waitFor": 0,                     // ms to wait before scraping (for JS-heavy pages)
  "timeout": 30000,                 // ms
  "blockAds": true,
  "proxy": "auto"                   // "auto", "basic", or "stealth"
}

Response: { "success": true, "data": { "markdown": "...", "metadata": {...} } }

2. Crawl an Entire Website

Crawling is async — starts a job, then poll for results.

POST /v2/crawl
{
  "url": "https://docs.example.com",
  "limit": 50,                      // Max pages
  "maxDepth": 3,
  "allowExternalLinks": false,
  "scrapeOptions": {
    "formats": ["markdown"],
    "onlyMainContent": true
  }
}

Response: { "success": true, "id": "crawl-job-id" }

Poll status:

GET /v2/crawl/{crawl-job-id}

Response: { "status": "completed", "total": 50, "data": [...] }

3. Map a Website's URLs

POST /v2/map
{ "url": "https://example.com" }

Response: { "success": true, "links": [{ "url": "...", "title": "..." }] }

4. Search + Scrape in One Call

POST /v2/search
{
  "query": "best web scraping tools 2025",
  "limit": 5,
  "scrapeOptions": { "formats": ["markdown"] }
}

Response: { "data": [{ "url": "...", "title": "...", "markdown": "..." }] }

5. Batch Scrape Multiple URLs

POST /v2/batch/scrape
{
  "urls": ["https://a.com", "https://b.com"],
  "formats": ["markdown"]
}

Returns a job ID; poll with GET /v2/batch/scrape/{id}

Apify API Reference

Base URL: https://api.apify.com/v2 Auth: Pass token as query param ?token=YOUR_TOKEN or in Authorization header.

Core Workflow

Apify runs "Actors" (pre-built scrapers). The flow is:

Start a run → get a runId and defaultDatasetId
Poll status until SUCCEEDED
Fetch results from the dataset

1. Run an Actor (Async)

POST /v2/acts/{actorId}/runs?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor-specific input... }

Response:

{
  "data": {
    "id": "RUN_ID",
    "status": "RUNNING",
    "defaultDatasetId": "DATASET_ID"
  }
}

Common Actor IDs:

apify/web-scraper — generic JS scraper
apify/google-search-scraper — Google SERPs
compass/crawler-google-places — Google Maps
apify/instagram-scraper — Instagram
clockworks/free-tiktok-scraper — TikTok
apify/amazon-scraper — Amazon products

2. Poll Run Status

GET /v2/acts/{actorId}/runs/{runId}?token=YOUR_TOKEN

Poll until status is SUCCEEDED or FAILED. Recommended interval: 5 seconds.

3. Fetch Results

GET /v2/datasets/{datasetId}/items?token=YOUR_TOKEN&format=json

Optional params: format (json/csv/xlsx/xml), limit, offset

4. Run Synchronously (≤5 minutes)

For short runs, use the sync endpoint — it waits and returns dataset items directly:

POST /v2/acts/{actorId}/run-sync-get-dataset-items?token=YOUR_TOKEN
Content-Type: application/json

{ ...actor input... }

Common Actor Inputs

Google Search Scraper:

{ "queries": "web scraping tools", "maxPagesPerQuery": 1, "resultsPerPage": 10 }

Google Maps Scraper:

{ "searchStringsArray": ["restaurants in Mumbai"], "maxCrawledPlaces": 20 }

Web Scraper (generic):

{
  "startUrls": [{ "url": "https://example.com" }],
  "pageFunction": "async function pageFunction(context) { const $ = context.jQuery; return { title: $('title').text() }; }",
  "maxPagesPerCrawl": 10
}

Output Handling

Firecrawl returns data directly in the response (or via polling for crawl/batch).
Apify stores results in a dataset; retrieve with GET /v2/datasets/{id}/items.
Both support JSON output. Firecrawl also provides clean markdown ideal for LLMs.
Apify also supports CSV, XLSX, XML output formats.

Code Templates

See references/code-templates.md for ready-to-run Python and JavaScript code for both APIs.

Error Handling

Firecrawl 402 → out of credits; user needs to upgrade plan
Firecrawl 429 → rate limited; add delays between requests
Apify FAILED run → check run logs via GET /v2/acts/{id}/runs/{runId}/log
Always wrap API calls in try/catch and check success: false in Firecrawl responses
Firecrawl crawls respect robots.txt by default
For JS-heavy pages, increase waitFor (Firecrawl) or use Playwright/Puppeteer actors (Apify)

Best Practices

Start small — test with 1 URL or a small limit before scaling
Use onlyMainContent: true in Firecrawl to remove nav/footer noise
Choose async for large jobs — don't use sync endpoints for crawls with 50+ pages
Store API keys securely — never hardcode them; use environment variables
Check rate limits — Firecrawl: varies by plan; Apify: 250k requests/min global
Prefer Firecrawl for LLM pipelines — markdown output is clean and ready for RAG/AI
Prefer Apify for social/structured data — specialized actors handle anti-bot better