PDF Field Extractor

AI-powered PDF structured data extraction — convert PDF key fields into Excel/JSON.

End-to-End Flow

User uploads PDF → Document type identification → AI field extraction → Structured output (Excel/JSON)

from scripts.pdf_extractor import extract_pdf_text
from scripts.field_extractor import extract_fields
from scripts.output_generator import generate_excel, generate_json

# Step 1: Extract PDF text (PyMuPDF + pdfplumber)
text, tables, images = extract_pdf_text("invoice.pdf")

# Step 2: AI field extraction (user provides own API Key, OpenAI-compatible)
fields = extract_fields(
    text=text,
    doc_type="invoice",
    api_key="sk-xxx",
    api_base="https://api.openai.com/v1",
    model="gpt-4o",
)

Supported Document Types

| Type | Description | |------|-------------| | Invoice | VAT invoice, receipt invoice, electronic invoice | | Contract | Contracts, agreements | | Receipt | Receipts, tickets | | Bank Statement | Bank reconciliation statements | | License | Business license | | ID Card | ID card, passport | | Express | Waybill, shipping label | | Generic | User-defined custom extraction |

Detection Modes

| Mode | Description | |------|-------------| | Auto | AI automatically identifies document type | | Manual | User specifies document type |

Tiered Features

| Feature | FREE | PRO | |---------|:----:|:---:| | Monthly pages | 10 | Unlimited | | Document types | Invoice only | All types | | Output formats | Text | Excel + JSON + Text | | OCR languages | English | English + Chinese + 9 more | | Batch processing | 1 page | Unlimited | | Custom fields | — | Yes | | Price | Free | $0.01/call |

Technical Implementation

PDF parsing: PyMuPDF (fitz) + pdfplumber for text and table extraction
OCR: EasyOCR / Tesseract for scanned documents (multi-language support)
AI extraction: OpenAI-compatible API, model-agnostic (GPT-4o, DeepSeek, GLM, etc.)
Output: Excel (.xlsx) with formatted sheets, JSON with structured hierarchy

Output Format

Excel Output

Sheet per document type
Header row with field names
Data rows with extracted values
Color-coded by confidence

JSON Output

{
  "doc_type": "invoice",
  "fields": {
    "invoice_number": "...",
    "date": "...",
    "amount": "...",
    "buyer": "...",
    "seller": "..."
  },
  "confidence": 0.95
}

Security Notes

AI API calls: Uses requests.post to OpenAI-compatible endpoints with user-provided API key (not stored)
Data storage: Uses /tmp/pdf-extractor/ for temporary processing files (no home directory write)
OCR: Local processing via EasyOCR/Tesseract (no external data transmission)
Billing data: FEISHU_USER_ID transmitted to skillpay.me/api/v1/billing for per-call charging

Billing

Billing via skillpay.me/api/v1/billing/charge
User data transmitted to SkillPay for billing identification
$0.01 USD per extraction call (PRO tier)

Required Environment Variables

| Variable | Description | |----------|-------------| | FEISHU_USER_ID | User open_id for billing | | SKILL_BILLING_API_KEY | SkillPay Builder API Key | | SKILL_BILLING_SKILL_ID | SkillPay Skill ID (default: pdf-extractor) |

Common Errors

| Error | Cause | Solution | |-------|-------|----------| | NO_TEXT_EXTRACTED | Scanned PDF without OCR | Enable OCR or use digital PDF | | UNSUPPORTED_DOC_TYPE | Document type not recognized | Specify type manually | | API_ERROR | AI API key invalid or quota exceeded | Check API key |