Node Crawler (crawler package)
crawler is a Node.js web spider library: internal queue + configurable
connection pool + per-domain rate limiting + automatic retries + proxy
rotation + charset detection + server-side Cheerio (jQuery-style) HTML
parsing. Built on got, supports HTTP/2.
Prerequisites
- Node.js >= 22
- Pure ESM:
import Crawler from "crawler" - If the codebase must use CommonJS, install
crawler@beta
npm install crawler
When to use
This skill is for production-grade, large-scale crawling. Reach for it when the task is substantial:
- Scraping many pages (dozens to millions) with structured data extraction
- Batch-downloading files — images, PDFs, archives — with retry and resume
(resume logic is developer-implemented via
userParamsand file existence checks) - Long-running spiders that need rate limiting, retries, and connection pooling
- Multi-step workflows — pagination, link following, cascading crawls
- Proxy rotation, charset detection, HTTP/2 — infrastructure a real production crawler depends on
When NOT to use
- A single page or one-off request →
curlis far lighter. Spinning up a Crawler instance for 1-2 pages is overkill. - Pages requiring JavaScript rendering → use the
agent-browserskill (Playwright/Puppeteer) instead - Simple API data fetching →
fetch/gotwith JSON parsing
API
Core decision: Queue vs Send
| Approach | When |
|----------|------|
| crawler.add() + 'drain' event | Most cases. Goes through the queue, pool, rate limiter, retries, proxy rotation |
| crawler.send() | One-off requests. Bypasses queue, rate limiter, preRequest, 'request' event |
Basic usage: Queue mode
import Crawler from "crawler";
const c = new Crawler({
maxConnections: 10,
callback: (error, res, done) => {
if (error) {
console.error(error);
} else {
const $ = res.$; // Cheerio instance (enabled by default)
console.log($("title").text());
}
done(); // REQUIRED: releases the connection slot, or the crawler deadlocks
},
});
c.on("drain", () => console.log("All done"));
c.add("https://example.com");
c.add(["https://a.com", "https://b.com"]);
c.add({ url: "https://c.com", jQuery: false,
callback: (e, res, done) => { /* custom callback */ done(); } });
Two most critical rules:
- Call
done()from every branch of every queue callback, including theif (error)branch - The crawler is finished when the
'drain'event fires, not whenadd()returns.add()merely enqueues tasks
Rate limiting and retries
const c = new Crawler({
rateLimit: 1000, // minimum gap between requests >=1000ms (forces maxConnections=1)
retries: 2, // default: 2
retryInterval: 3000,// ms to wait before retrying
timeout: 20000, // request timeout in ms
callback: (e, res, done) => { /* ... */ done(); },
});
Data extraction with Cheerio
Cheerio is enabled by default. Use jQuery selectors to extract data:
callback: (e, res, done) => {
const $ = res.$;
const titles = $("h2.title").map((i, el) => $(el).text().trim()).get();
const links = $("a").map((i, el) => $(el).attr("href")).get();
done();
}
Binary file download
import fs from "fs";
const c = new Crawler({
encoding: null, // keep body as Buffer
jQuery: false, // skip Cheerio parsing
callback: (err, res, done) => {
if (!err) fs.writeFileSync(res.options.userParams.filename, res.body);
done();
},
});
c.add({ url: "https://host/file.png", userParams: { filename: "file.png" } });
Proxy rotation
Use the 'schedule' event for dynamic assignment (preferred over using the
proxies array):
c.on("schedule", options => { options.proxy = "http://proxy:port"; });
c.on("request", options => { options.searchParams = { t: Date.now() }; });
Different proxies can have different rate limiters:
c.add({ url: "...", rateLimiterId: 1, proxy: "http://p1:port" });
c.add({ url: "...", rateLimiterId: 2, proxy: "http://p2:port" });
HTTP/2
c.add({ url: "https://...", http2: true, callback: (e, res, done) => { done(); } });
When using Charles or self-signed certs, add rejectUnauthorized: false.
Passing context data
Use userParams to attach data; read it back in the callback via
res.options.userParams. Do not attach custom fields directly on the
options object.
Gotchas
- ❌ Forgetting
done()in theif (error)branch → crawler deadlocks - ❌ Writing
console.log("done")right afteradd()→ listen for'drain'instead - ❌ Setting
maxConnections > 1whenrateLimit > 0→ it gets overridden to 1 - ❌ Expecting
send()to triggerpreRequestor'request'→send()bypasses all queue mechanics - ❌ POST form data via
body→ v2 requiresform - ❌ Binary download without
encoding: null→ corrupt output
Options reference
The complete options table is in references/options.md.
Code examples
Full, runnable examples for every scenario are in references/examples.md:
- Basic queue crawling
- Rate limiting
- Cheerio data extraction
- Binary download
- Direct requests
- HTTP/2
- Proxy rotation
- preRequest hooks
- Full spider (pagination + extraction + following links)
微信扫一扫