返回 Skill 列表
extension
分类: 其它无需 API Key

Douyin Content Tracker Skill

This skill should be used when the user wants to scrape Douyin (TikTok China) creator content, download audio, and transcribe it with Whisper. Covers first-t...

person作者: gpttanghubclawhub

Douyin Content Tracker

Scrapes Douyin creator videos via MediaCrawler, downloads audio with ffmpeg, and transcribes speech with Whisper.

Finding the Skill Base Directory

All commands must run from this skill's directory. To locate it, run:

python -c "import pathlib; print([p for p in pathlib.Path.home().rglob('douyin-content-tracker-skill/SKILL.md')])"

Or check common locations:

  • ~/.claude/skills/douyin-content-tracker-skill/
  • The path shown when the skill was installed

Set it as a variable for convenience:

SKILL_DIR="~/.claude/skills/douyin-content-tracker-skill"   # adjust to actual path
cd "$SKILL_DIR"

First-Time Setup

Run these steps once on a new machine.

1. Install Python dependencies

cd $SKILL_DIR
pip install -r scripts/requirements.txt
python -m playwright install chromium

2. Install MediaCrawler

# Windows
git clone https://github.com/NanmiCoder/MediaCrawler D:/MediaCrawler
cd D:/MediaCrawler && pip install -r requirements.txt

# macOS/Linux
git clone https://github.com/NanmiCoder/MediaCrawler ~/MediaCrawler
cd ~/MediaCrawler && pip install -r requirements.txt

3. Configure .env

cd $SKILL_DIR
cp .env.template .env

Edit .env — required field:

MEDIACRAWLER_DIR=D:/MediaCrawler    # adjust to actual MediaCrawler path (use ~/MediaCrawler on macOS/Linux)

Optional overrides:

# Where to store data/audio/subtitles/models (default: ~/DouyinContentTracker or %USERPROFILE%\DouyinContentTracker)
OUTPUT_BASE_DIR=/Users/me/DouyinContentTracker

# Whisper model size (default: medium)
WHISPER_MODEL=small

4. Add target accounts

Edit accounts.txt (or set TRACKER_ACCOUNTS_FILE / pass --accounts-file when running):

博主名称 | https://www.douyin.com/user/MS4wLjABAAAA...

5. First login (generates cookie)

cd $SKILL_DIR
python scripts/scrape_profile.py

A browser opens — scan the Douyin QR code to log in. Cookie is saved to .douyin_cookies.json.


Daily Usage

cd $SKILL_DIR

# Track latest 3 videos per account (default). main.py mirrors track_latest.py
python scripts/track_latest.py
# or
python scripts/main.py

# Track latest N videos
python scripts/track_latest.py --limit 5

# Use a custom account list (also works via env TRACKER_ACCOUNTS_FILE)
python scripts/track_latest.py --accounts-file /path/to/accounts.txt

# Skip audio download and transcription (data only)
python scripts/track_latest.py --no-audio

Cookie Refresh

When scraping returns 0 videos or warns "Cookie 已 N 天未更新":

cd $SKILL_DIR
python scripts/scrape_profile.py    # opens browser, scan QR

Pipeline Flow

accounts.txt (or the list pointed by --accounts-file / TRACKER_ACCOUNTS_FILE)
    ↓
scripts/scrape_profile.py   → MediaCrawler (CDP) → OUTPUT_BASE_DIR/data/*.csv
    ↓
scripts/clean_data.py       → normalized OUTPUT_BASE_DIR/data/cleaned_*.csv
    ↓
scripts/download_video.py   → Playwright + ffmpeg → OUTPUT_BASE_DIR/audio/{blogger}/*.m4a
    ↓
scripts/extract_subtitle.py → Whisper → OUTPUT_BASE_DIR/subtitles/{blogger}/{video_id}.md

Output Locations

All generated files live under OUTPUT_BASE_DIR (defaults to ~/DouyinContentTracker on macOS/Linux, %USERPROFILE%\DouyinContentTracker on Windows).

| Subdir | Contents | |--------|----------| | data/cleaned_*.csv | Scraped + normalized video metadata | | audio/{blogger}/{video_id}.m4a | Extracted audio | | subtitles/{blogger}/{video_id}.md | Whisper transcript (title as first line) | | subtitles/{blogger}.md | All transcripts for one blogger merged |


Execution Logging Guide

When running the pipeline, report progress to the user after each step completes. Do not wait until the entire pipeline finishes.

Step-by-step reporting template:

After each Bash tool call returns, immediately tell the user:

| Step | What to report | |------|---------------| | 采集(scrape) | 博主名称、采集到的视频条数,若失败注明原因 | | 清洗(clean) | 清洗后有效条数 | | 音频下载(download) | 成功下载的音频数 / 总数,跳过的条数 | | 语音识别(whisper) | 生成的字幕文件数,输出路径 | | 完成 | 汇总:共处理博主数、视频数、生成字幕数,以及输出目录路径 |

If a step fails, stop the pipeline, report the error output verbatim, and suggest the matching fix from references/troubleshooting.md before asking the user whether to continue.

Example output style:

[步骤 1/4 采集] 博主「某某」— 采集完成,共 10 条视频
[步骤 2/4 清洗] 有效数据 10 条 → data/cleaned_profile_xxx.csv
[步骤 3/4 音频] 下载完成 8/10(2 条无音频流,已跳过)
[步骤 4/4 字幕] 生成 8 个字幕文件 → subtitles/某某/
[完成] 1 位博主 · 10 条视频 · 8 个字幕,输出目录:~/DouyinContentTracker

References

Load these files into context when debugging or extending the pipeline:

  • references/pipeline.md — per-script technical breakdown, data schemas, key function signatures
  • references/troubleshooting.md — fixes for cookie, MediaCrawler, ffmpeg, Whisper, and data errors