← 返回 Skill 列表

extension

分类: 数据与分析无需 API Key

Web Scraper

支持 JavaScript 渲染的网页抓取技能，使用 CSS 选择器、XPath 或 AI 提取网站数据。

Web Scraper

Extract data from websites with support for dynamic content.

When to Use

User wants to scrape data from a website
Extract structured data from HTML
Handle JavaScript-rendered pages
Crawl multiple pages

Features

Static pages: Fast HTML parsing
Dynamic pages: Playwright/Puppeteer rendering
Selectors: CSS, XPath, regex
AI extraction: Auto-detect data patterns

Usage

Simple scrape

python3 scripts/scrape.py \
  --url "https://example.com/products" \
  --selector ".product-name" \
  --output ./products.json

With JavaScript rendering

python3 scripts/scrape.py \
  --url "https://spa-example.com/data" \
  --render \
  --wait 2000 \
  --selector ".data-item"

Extract multiple fields

python3 scripts/scrape.py \
  --url "https://example.com/listings" \
  --fields '{
    "title": "h1.title",
    "price": ".price",
    "description": ".desc"
  }'

Crawl multiple pages

python3 scripts/scrape.py \
  --url "https://example.com/page/1" \
  --crawl 'a[href*="/page/"]' \
  --max-pages 10 \
  --selector ".item"

AI-powered extraction

python3 scripts/scrape.py \
  --url "https://example.com/article" \
  --ai-extract "Extract the title, author, and publication date"

Output

{
  "success": true,
  "url": "https://example.com/products",
  "items": [
    {"name": "Product 1", "price": "$99"},
    {"name": "Product 2", "price": "$149"}
  ],
  "scraped_at": "2024-01-15T10:30:00Z"
}

Rate Limiting

Default delay: 1 second between requests
Respects robots.txt
Customizable user agent