返回 Skill 列表
extension
分类: 数据与分析无需 API Key

URL to Markdown

将HTTP/HTTPS网址的HTML网页转换为干净、可读的Markdown文件,支持可选的批量处理和格式化功能。

person作者: rwonlyhubclawhub

Url2md

Convert web pages to clean, readable Markdown.

Quick Start

Single URL

python3 scripts/url2md.py https://example.com/article

Output to a file:

python3 scripts/url2md.py https://example.com/article -o article.md

Batch Conversion

Create a file with URLs (one per line):

https://example.com/article-1
https://example.com/article-2
https://example.com/article-3

Convert all and save to a directory:

python3 scripts/url2md.py -f urls.txt -d ./markdown_files/

Features

  • No dependencies: Uses only Python standard library (urllib, html.parser)
  • Reader-style scope: Strips script/style/noscript/template, then prefers the first <article> or <main> (else <body>) so output focuses on primary content
  • Title extraction: Uses og:title / Twitter title when present, otherwise <title>, added as H1 when enabled
  • YAML Frontmatter: Extracts structured metadata (title, author, published, description, category, source) from <meta> tags and Schema.org JSON-LD for knowledge-base workflows
  • Template system: Customize output format with variables ({{title}}, {{content}}, {{author}}, {{published}}, {{date}}, etc.)
  • Link resolution: Converts relative URLs to absolute
  • Basic formatting: Headings, paragraphs, lists, links, images, fenced code (with optional language), GFM-style tables, bold/italic
  • Noise removal: Skips navigation, sidebars, footers, forms, and other chrome inside the parsed fragment

Script Reference

scripts/url2md.py

Usage:

url2md.py [url] [options]

Options: | Option | Description | |--------|-------------| | url | Single URL to convert | | -o, --output | Output file (default: stdout) | | -f, --file | File containing URLs to convert | | -d, --dir | Output directory for batch conversion | | --no-title | Skip adding page title as H1 | | --full-page | Parse full <body> instead of <article>/<main> first (more chrome, wider coverage) | | --timeout | Request timeout in seconds (default: 30) | | --frontmatter | Add YAML frontmatter with extracted metadata | | -t, --template | Path to a template file for customizing output | | --filename-template | Batch mode filename pattern (e.g. {{date}}-{{title}}.md) | | --download-images | Download remote images to a local folder (e.g. assets) | | -v, --version | Show version |

Examples:

# Single URL to stdout
python3 scripts/url2md.py https://docs.python.org/3

# Save to file
python3 scripts/url2md.py https://docs.python.org/3 -o python-docs.md

# Batch with custom timeout
python3 scripts/url2md.py -f urls.txt -d ./output/ --timeout 60

# Skip title
python3 scripts/url2md.py https://example.com --no-title

# Whole body (no article/main focus)
python3 scripts/url2md.py https://example.com/sitemap --full-page -o sitemap.md

# YAML frontmatter (great for Obsidian / PKM)
python3 scripts/url2md.py https://example.com/article --frontmatter -o article.md

# Custom template
python3 scripts/url2md.py https://example.com/article -t article.tpl -o article.md

# Batch with smart filenames
python3 scripts/url2md.py -f urls.txt -d ./output/ --filename-template "{{date}}-{{title}}.md"

# Download images locally
python3 scripts/url2md.py https://example.com/article -o article.md --download-images assets
python3 scripts/url2md.py -f urls.txt -d ./output/ --download-images assets

Template variables: {{title}}, {{content}}, {{url}}, {{source}}, {{author}}, {{published}}, {{description}}, {{category}}, {{site_name}}, {{date}}, {{datetime}}

When to Use

  • Converting documentation pages to Markdown for local reference
  • Archiving web articles as text files
  • Building a knowledge base with structured metadata (frontmatter / templates)
  • Building static content from dynamic sources
  • Extracting readable content when browser tools are unavailable
  • Batch processing a list of URLs

Limitations

  • Converts static HTML only; does not execute JavaScript
  • Complex layouts (multi-column, heavy CSS) may lose structural fidelity
  • Login-required or paywalled content requires authentication tokens
  • Rate-limited sites may block repeated requests