X.com 全量推文采集 / X.com Comprehensive Tweet Scraper
中文: 通过 Playwright 浏览器自动化三种方案采集 X 用户的公开推文历史。 English: Scrape public tweet history from X.com users via three Playwright-based strategies.
Tags: #twitter-scraper #x-tweet-collector #playwright-automation #web-scraping #data-collection #social-media-analytics #x-api-scraping #browser-automation #tweet-archive #python-scraper
前置条件 / Prerequisites
环境 / Environment
- Python 3.10+,
pip install playwright,python -m playwright install chromium - 三种方案均使用 Playwright(
from playwright.sync_api import sync_playwright),不依赖requests - Windows:
$env:PYTHONIOENCODING="utf-8"
Cookie / Authentication
# 1. Login to https://x.com in Chrome/Edge
# 2. F12 → Application → Cookies → https://x.com
# 3. Right-click → Export as JSON
# 4. Save as <username>_data/x_cookies.json
方案对比 / Strategy Comparison
| 方案 | 脚本 | 覆盖量 | 限流 | 推荐场景 |
| Strategy | Script | Coverage | Rate Limit | Best For |
|----------|--------|----------|-----------|----------|
| 自动搜索 Auto-Search | scrape_x_auto.py | ~2000+ | 低 Low | 全量采集 Full collection |
| GraphQL API | scrape_x_v9.py | ~900 | 高 High | 补充最新 Fetch latest |
| DOM 滚动 DOM Scroll | scrape_x_scroll.py | ~1400 | 最低 Lowest | Profile 页面滚动 Profile scrolling |
方案一:自动搜索(推荐) / Strategy 1: Auto-Search (Recommended)
逻辑流程 / Logic Flow
Load seed file (known IDs)
→ Generate 14-day search chunks (2025-07-01 ~ today)
→ For each chunk:
→ Open X.com search: from:user since:date until:date
→ Detect login status (check "Sign in")
→ Scroll to bottom → wait 3s → extract DOM articles
→ If blank: scroll up 2000px → retry
→ Dedup by ID → save new tweets
→ If y-position unchanged for 25 rounds → chunk complete
→ If <30 new tweets in chunk → split into 7-day sub-chunks
→ Merge into seed file
关键参数 / Key Config
CHUNK_DAYS = 14 # 搜索块大小 days per chunk
SCROLL_DELAY = 3 # 滚动间隔秒数 scroll interval
WEEK_CHUNK_THRESHOLD = 30 # 稀疏块触发拆周阈值 sub-chunk threshold
BLOCK_TIMEOUT = 180 # 每块超时秒秒 per-chunk timeout
SAME_POS_LIMIT = 25 # 滚动不变判定到底 scroll stuck limit
运行 / Run
$env:PYTHONIOENCODING="utf-8"
python -u scripts/scrape_x_auto.py --username target_user --seed existing_data.json
方案二:GraphQL API / Strategy 2: GraphQL API
逻辑流程 / Logic Flow
Playwright launch → inject cookies → open x.com/home
→ Verify login (check "Sign in")
→ Call UserByScreenName via browser fetch() to get user_id
→ Call UserTweets via browser fetch() to get tweets
URL: https://x.com/i/api/graphql/{QUERY_ID}/UserTweets?variables={...}
NOTE: Must append /UserTweets after QUERY_ID
→ Extract from response: instructions → TimelineAddEntries → tweet_results → legacy
→ Paginate via cursor-bottom
→ 15-min delay between pages (PAGE_DELAY=900)
→ On 429: wait 30 min (RATE_WAIT=1800), max 60 retries (MAX_RATE_LIMITS=60)
爬坑要点 / Pitfalls
- Query ID expires; extract fresh one from browser DevTools Network tab
- Variable names use camelCase:
screenNamenotscreen_name - Response structure: check both
data.user.resultanddata.user_result_by_screen_name.result
运行 / Run
$env:PYTHONIOENCODING="utf-8"
python -u scripts/scrape_x_v9.py --username target --batch --pages 50 --seed data.json --auto-merge data.json
方案三:DOM 滚动 / Strategy 3: DOM Scroll
逻辑流程 / Logic Flow
Playwright launch → inject cookies → open profile page /username
→ Verify login
→ Jump to checkpoint scroll_y (if resuming)
→ Loop:
→ window.scrollTo(0, document.body.scrollHeight)
→ wait 3s → extract article[data-testid="tweet"]
→ If blank: scroll up 2000px → retry (3x)
→ Dedup by ID → save new tweets
→ If scroll_y unchanged for MAX_EMPTY=50 rounds → bottom reached
→ Merge into seed file → cleanup checkpoint
运行 / Run
$env:PYTHONIOENCODING="utf-8"
python -u scripts/scrape_x_scroll.py --username target --seed existing_data.json
踩坑记录 / Known Issues
| # | 问题 Issue | 修复 Fix | 适用脚本 |
|---|-----------|----------|---------|
| 1 | Login check false-positive on "Login" string | Only check "Sign in" in content | All |
| 2 | Search page "Something went wrong" | Split into small chunks (14 days) | auto |
| 3 | Scroll blank after big jump | Scroll up 2000px to trigger lazy load | auto, scroll |
| 4 | Fast-skip exits early when all tweets known | Don't increment empty counter when DOM has content | scroll |
| 5 | Time field name mismatch | timestamp (scroll/auto) vs created_at (API) | merge |
| 6 | GraphQL API ~900 tweet hard limit | Use search/scroll for full history | v9 |
| 7 | Query ID + missing /UserTweets suffix | Append operation name: {QID}/UserTweets | v9 |
数据合并 / Data Merge
with open(seed_file) as f:
raw = json.load(f)
target = raw.get('tweets', raw) if isinstance(raw, dict) else raw
d = {t['id']: t for t in target if 'id' in t}
for t in new_tweets:
d[t['id']] = t
merged = list(d.values())
# Write back
文件结构 / File Structure
x-scraper-comprehensive/
├── SKILL.md # 主配置 Main config
├── README.md # 双语说明 Bilingual readme
├── README_EN.md # 英文版 English version
├── scripts/
│ ├── scrape_x_auto.py # 自动搜索 Auto-search
│ ├── scrape_x_v9.py # GraphQL API
│ └── scrape_x_scroll.py # DOM 滚动 DOM scroll
└── references/
└── workflow.md # 完整工作流 Full workflow
微信扫一扫