google-images-crawler

从Google图片搜索中爬取高分辨率原始图片。当用户需要（1）从Google搜索并下载图片，（2）获取原始/全尺寸图片而不是缩略图，（3）按关键词批量下载图片，（4）从Google图片搜索结果中提取图片URL时使用。支持指定图片数量、按大小筛选以及下载到本地存储。

Google Images Crawler

Crawl original (non-thumbnail) images from Google Images search results.

Key Difference: Original vs Thumbnail

Thumbnail URLs (low quality, avoid): https://encrypted-tbn0.gstatic.com/images?q=tbn:...
Original URLs (high quality, target): External domain links like https://example.com/photo.jpg

This skill extracts the original high-resolution images, not the low-quality thumbnails.

Quick Start

1. Search and Get Image URLs

python3 scripts/crawl_google_images.py "search keyword" --count 10

2. Download Images

python3 scripts/download_images.py urls.txt --output ./images

Methods for Extracting Original Images

Method 1: From href links (Recommended)

Google Images wraps original URLs in imgurl parameter:

import re
# Extract from a[href*="imgurl="] links
match = re.search(r'imgurl=([^&]+)', href)
original_url = match.group(1)

Method 2: From page scripts

Parse JSON embedded in page HTML containing image metadata.

Method 3: From rg_meta divs (Legacy)

# Google sometimes embeds metadata in div.rg_meta
data = json.loads(div.text_content)
original_url = data['ou']  # original URL

Core Script

Use scripts/crawl_google_images.py:

from playwright.sync_api import sync_playwright
import re

def crawl_google_images(keyword, count=10):
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        # Navigate to Google Images
        page.goto(f"https://www.google.com/search?q={keyword}&tbm=isch")
        page.wait_for_timeout(3000)
        
        # Method 1: Extract from imgurl parameter
        links = page.eval_on_selector_all('a[href*="imgurl="]', 
            'els => els.map(e => e.href)')
        
        original_urls = []
        for link in links:
            match = re.search(r'imgurl=([^&]+)', link)
            if match:
                url = match.group(1)
                # URL decode
                url = url.replace('%3A', ':').replace('%2F', '/')
                if 'gstatic' not in url and 'google' not in url:
                    original_urls.append(url)
        
        browser.close()
        return original_urls[:count]

Download Script

import requests

def download_image(url, output_path):
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
    }
    r = requests.get(url, headers=headers, timeout=30)
    if r.status_code == 200:
        with open(output_path, "wb") as f:
            f.write(r.content)
        return len(r.content)
    return 0

Common Issues

Connection reset: Some sites block scrapers, use retry with different headers
Low resolution thumbnails: Always filter URLs containing gstatic or google
Rate limiting: Add delays between requests

References

references/advanced_filtering.md - Size, type, and color filtering options
references/api_alternative.md - Using Google Custom Search API as alternative