返回 Skill 列表
extension
分类: 营销与增长无需 API Key

robots-txt

当用户想要配置、审核或优化robots.txt时。也适用于用户提到“robots.txt”、“爬虫规则”、“阻止爬虫”、“AI爬虫”、“GPTBot”、“允许/禁止”、“禁止路径”、“爬取指令”、“用户代理”、“阻止Googlebot”、“修复robots.txt”、“robots.txt阻止”或“搜索引擎爬取”时。

person作者: jakexiaohubgithub

SEO Technical: robots.txt

Guides configuration and auditing of robots.txt for search engine and AI crawler control.

When invoking: On first use, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On subsequent use or when the user asks to skip, go directly to the main output.

Scope (Technical SEO)

  • Robots.txt: Configure Disallow/Allow, Sitemap, Clean-param; audit for accidental blocks
  • Crawler access: Path-level crawl control; AI crawler allow/block strategy
  • Differentiation: robots.txt = crawl control (who accesses what paths); noindex = index control (what gets indexed). See indexing for page-level exclusions.

Initial Assessment

Check for project context first: If .claude/project-context.md or .cursor/project-context.md exists, read it for site URL and indexing goals.

Identify:

  1. Site URL: Base domain (e.g., https://example.com)
  2. Indexing scope: Full site, partial, or specific paths to exclude
  3. AI crawler strategy: Allow search/indexing vs. block training data crawlers

Best Practices

Purpose and Limitations

| Point | Note | |-------|------| | Purpose | Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet) | | Advisory | Rules are advisory; malicious crawlers may ignore | | Public | robots.txt is publicly readable; use noindex or auth for sensitive content. See indexing |

Crawl vs Index vs Link Equity (Quick Reference)

| Tool | Controls | Prevents indexing? | |------|----------|-------------------| | robots.txt | Crawl (path-level) | No—blocked URLs may still appear in SERP | | noindex (meta / X-Robots-Tag) | Index (page-level) | Yes. See indexing | | nofollow | Link equity only | No—does not control indexing |

When to Use robots.txt vs noindex

| Use | Tool | Example | |-----|------|---------| | Path-level (whole directory) | robots.txt | Disallow: /admin/, Disallow: /api/, Disallow: /staging/ | | Page-level (specific pages) | noindex meta / X-Robots-Tag | Login, signup, thank-you, 404, legal. See indexing for full list | | Critical | Do NOT block in robots.txt | Pages that use noindex—crawlers must access the page to read the directive |

Paths to block in robots.txt: /admin/, /api/, /staging/, temp files. Paths to use noindex (allow crawl): /login/, /signup/, /thank-you/, etc.—see indexing.

Location and Format

| Item | Requirement | |------|-------------| | Path | Site root: https://example.com/robots.txt | | Encoding | UTF-8 plain text | | Standard | RFC 9309 (Robots Exclusion Protocol) |

Core Directives

| Directive | Purpose | Example | |-----------|---------|---------| | User-agent: | Target crawler | User-agent: Googlebot, User-agent: * | | Disallow: | Block path prefix | Disallow: /admin/ | | Allow: | Allow path (can override Disallow) | Allow: /public/ | | Sitemap: | Declare sitemap absolute URL | Sitemap: https://example.com/sitemap.xml | | Clean-param: | Strip query params (Yandex) | See below |

Critical: Do Not Block

| Do not block | Reason | |--------------|--------| | CSS, JS, images | Google needs them to render pages; blocking breaks indexing | | /_next/ (Next.js) | Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See indexing | | Pages that use noindex | Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that |

Only block: paths that don't need crawling: /admin/, /api/, /staging/, temp files.

AI Crawler Strategy

robots.txt is effective for all measured AI crawlers (Vercel/MERJ study, 2024). Set rules per user-agent; check each vendor's docs for current tokens.

| User-agent | Purpose | Typical | |------------|---------|---------| | OAI-SearchBot | ChatGPT search | Allow | | GPTBot | OpenAI training | Disallow | | Claude-SearchBot | Claude search | Allow | | ClaudeBot | Anthropic training | Disallow | | PerplexityBot | Perplexity search | Allow | | Google-Extended | Gemini training | Disallow | | CCBot | Common Crawl (LLM training) | Disallow | | Bytespider | ByteDance | Disallow | | Meta-ExternalAgent | Meta | Disallow | | AppleBot | Apple (Siri, Spotlight); renders JS | Allow for indexing |

Allow vs Disallow: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training. See site-crawlability for AI crawler optimization (SSR, URL management).

Clean-param (Yandex)

Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid

Output Format

  • Current state (if auditing)
  • Recommended robots.txt (full file)
  • Compliance checklist
  • References: Google robots.txt

Related Skills

  • indexing: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis
  • page-metadata: Meta robots (noindex, nofollow) implementation
  • xml-sitemap: Sitemap URL to reference in robots.txt
  • site-crawlability: Broader crawl and structure guidance; AI crawler optimization
  • rendering-strategies: SSR, SSG, CSR; content in initial HTML for crawlers