tra-extract-text
Extract text from web pages using the trafilatura command-line tool.
Installation
pip install trafilatura
Usage
Basic text extraction (Markdown)
trafilatura -u URL --markdown
Extract raw text (no formatting)
trafilatura -u URL --text
Output to file
trafilatura -u URL --markdown > output.md
trafilatura -u URL --text > output.txt
CLI Options
| Option | Description |
|--------|-------------|
| -u, --url | Target URL (required) |
| --markdown | Output as Markdown (default) |
| --text | Output as plain text |
| --html | Output as HTML |
| --json | Output as JSON |
| --xml | Output as XML |
| -o, --output | Write to file instead of stdout |
| --with-metadata | Include metadata (title, author, date) |
| --license | Show license info |
Examples
Extract a Medium article to markdown:
trafilatura -u "https://medium.com/example/article" --markdown
Extract and save:
trafilatura -u "https://news.example.com/article" --markdown -o article.md
Extract with metadata:
trafilatura -u "https://example.com/post" --markdown --with-metadata
微信扫一扫