返回 Skill 列表
extension
分类: 其它需要 API Key

Web Scraper - Firecrawl

使用 Firecrawl API 进行网页抓取和数据提取,支持爬取网站、提取结构化数据、网页转 Markdown 等功能。

person作者: antonia-szhubclawhub

Firecrawl Skill

Powerful web scraping powered by Firecrawl - turn websites into LLM-ready markdown.

Overview

Firecrawl provides APIs for:

  • Scrape - Single page extraction to markdown
  • Crawl - Entire site crawling with depth control
  • Map - URL discovery from a starting point
  • Batch - Multiple URL processing
  • Extract - Structured data extraction with schemas

Prerequisites

  1. Firecrawl API Key - Get free tier at https://firecrawl.dev
  2. Install Python dependencies: requests

Configuration

Set environment variable:

export FIRECRAWL_API_KEY="fc-your-api-key"

Usage

Single Page Scraping

# Basic scrape
firecrawl scrape https://example.com

# With specific options
firecrawl scrape https://example.com --formats markdown,html --only-main-content

# Wait for JS rendering
firecrawl scrape https://spa-app.com --wait-for 2000

Site Crawling

# Crawl entire site (up to limit)
firecrawl crawl https://docs.example.com --limit 50

# With depth control
firecrawl crawl https://blog.example.com --max-depth 2 --limit 100

# Include/exclude patterns
firecrawl crawl https://site.com --include "/blog/*" --exclude "/admin/*"

# Custom formats
firecrawl crawl https://docs.example.com --formats markdown,links

URL Mapping

# Discover all URLs from a site
firecrawl map https://example.com

# With search term
firecrawl map https://docs.python.org --search "tutorial"

Batch Processing

# Scrape multiple URLs
firecrawl batch urls.txt --output ./scraped/

# From JSON list
firecrawl batch urls.json --formats markdown --concurrency 5

Structured Extraction

# Extract specific data using CSS selectors
firecrawl extract https://example.com/products \
  --schema '{"name": ".product-title", "price": ".price", "description": ".desc"}'

# Extract to JSON
firecrawl extract https://news.example.com/article --schema article-schema.json

Output Formats

Markdown

Clean, LLM-ready markdown with:

  • Headings preserved
  • Links converted to markdown format
  • Images with alt text
  • Tables formatted as markdown tables

HTML

Raw or cleaned HTML

Links

Extracted link lists for further crawling

Screenshot

Page screenshot (if requested)

Use Cases

Knowledge Base Building

# Crawl documentation site
firecrawl crawl https://docs.framework.com --limit 200 -o ./kb/

# Merge into single file for RAG
cat ./kb/*.md > knowledge-base.md

Research & Analysis

# Scrape competitor pricing
firecrawl batch competitors.txt --extract pricing-schema.json

# Monitor blog updates
firecrawl map https://blog.company.com --since 2024-01-01

Content Migration

# Export old CMS content
firecrawl crawl https://old-site.com --formats markdown,html -o ./export/

Scripts

All functionality via scripts/firecrawl.py:

  • Handles API authentication
  • Automatic rate limiting
  • Retry logic for failures
  • Progress tracking for large crawls

Integration

Works well with:

  • markdown-sync-pro - Sync scraped content to Notion/GitHub
  • arxiv-paper - Combine with academic paper downloads
  • maybe-finance - Scrape financial data for analysis