返回 Skill 列表
extension
分类: 数据与分析无需 API Key

Lightpanda Scraper

快速无头浏览器网页抓取,使用 Lightpanda(页面加载 0.5 秒,比 Chromium 快 90 倍),适用于 OSINT 侦察、链接提取和内容抓取。

person作者: hostilespiderhubclawhub

Lightpanda Scraper — Fast Headless Browser for OSINT

Blazing fast web scraping using Lightpanda, a Zig-based headless browser. 0.5s per page vs 45s for Chromium/Playwright. Perfect for OSINT recon, link extraction, and content scraping.

Prerequisites

Install Lightpanda binary:

mkdir -p ~/.local/bin
curl -L https://github.com/nicholasgasior/lightpanda-browser/releases/latest/download/lightpanda-linux-x86_64 -o ~/.local/bin/lightpanda
chmod +x ~/.local/bin/lightpanda

Quick Start

# Dump page as markdown
python3 {baseDir}/scripts/lp-scrape.py https://target.com

# Extract all links
python3 {baseDir}/scripts/lp-scrape.py https://target.com --links

# Get raw HTML
python3 {baseDir}/scripts/lp-scrape.py https://target.com --html

Options

  • --links — Extract and categorize all links from the page
  • --html — Dump raw HTML instead of markdown
  • --frames — Include iframe content
  • --js "code" — Evaluate JavaScript on the page
  • --output FILE — Save output to file
  • --wait MODE — Wait condition: networkidle (default), load, domcontentloaded
  • --strip TYPES — Comma-separated resource types to strip: js, css, images
  • --proxy URL — Use proxy (e.g., socks5://127.0.0.1:9050 for Tor)
  • --timeout SECS — Request timeout (default: 30)
  • --serve --port PORT — Start CDP server mode
  • --mcp — Start as MCP server (stdio)

Use Cases

OSINT Recon

# Quick page dump for analysis
python3 {baseDir}/scripts/lp-scrape.py https://target.com > recon.md

# Extract all endpoints from a site
python3 {baseDir}/scripts/lp-scrape.py https://target.com --links | grep -i api

# Crawl with Tor
python3 {baseDir}/scripts/lp-scrape.py https://target.com --proxy socks5://127.0.0.1:9050

Bug Bounty Recon

# Fast subdomain content grab
for sub in api admin dev staging; do
  python3 {baseDir}/scripts/lp-scrape.py https://$sub.target.com --links 2>/dev/null
done

Content Extraction

# Save clean markdown
python3 {baseDir}/scripts/lp-scrape.py https://article.com --output article.md

# JavaScript evaluation
python3 {baseDir}/scripts/lp-scrape.py https://app.com --js "document.querySelectorAll('a').length"

CDP Server Mode

# Start server for programmatic access
python3 {baseDir}/scripts/lp-scrape.py --serve --port 9222
# Then connect with any CDP client

Speed Comparison

| Tool | Page Load | Memory | Binary Size | |------|-----------|--------|-------------| | Lightpanda | ~0.5s | ~50MB | ~100MB | | Chromium/Playwright | ~45s | ~500MB | ~300MB | | curl/wget | ~0.3s | ~5MB | N/A |

Lightpanda gives you Playwright-like page rendering at near-curl speeds. The catch: no complex JS interactions (use Playwright for those).

Notes

  • Lightpanda is in active development; some complex SPAs may not render perfectly
  • For authenticated sessions or complex JS interactions, use Playwright instead
  • Binary is ~100MB Zig-compiled native code, runs on Linux x86_64
  • Supports HTTP/SOCKS5 proxies for Tor or VPN routing