PDF Field Extractor
AI-powered PDF structured data extraction — convert PDF key fields into Excel/JSON.
End-to-End Flow
User uploads PDF → Document type identification → AI field extraction → Structured output (Excel/JSON)
from scripts.pdf_extractor import extract_pdf_text
from scripts.field_extractor import extract_fields
from scripts.output_generator import generate_excel, generate_json
# Step 1: Extract PDF text (PyMuPDF + pdfplumber)
text, tables, images = extract_pdf_text("invoice.pdf")
# Step 2: AI field extraction (user provides own API Key, OpenAI-compatible)
fields = extract_fields(
text=text,
doc_type="invoice",
api_key="sk-xxx",
api_base="https://api.openai.com/v1",
model="gpt-4o",
)
Supported Document Types
| Type | Description | |------|-------------| | Invoice | VAT invoice, receipt invoice, electronic invoice | | Contract | Contracts, agreements | | Receipt | Receipts, tickets | | Bank Statement | Bank reconciliation statements | | License | Business license | | ID Card | ID card, passport | | Express | Waybill, shipping label | | Generic | User-defined custom extraction |
Detection Modes
| Mode | Description | |------|-------------| | Auto | AI automatically identifies document type | | Manual | User specifies document type |
Tiered Features
| Feature | FREE | PRO | |---------|:----:|:---:| | Monthly pages | 10 | Unlimited | | Document types | Invoice only | All types | | Output formats | Text | Excel + JSON + Text | | OCR languages | English | English + Chinese + 9 more | | Batch processing | 1 page | Unlimited | | Custom fields | — | Yes | | Price | Free | $0.01/call |
Technical Implementation
- PDF parsing: PyMuPDF (fitz) + pdfplumber for text and table extraction
- OCR: EasyOCR / Tesseract for scanned documents (multi-language support)
- AI extraction: OpenAI-compatible API, model-agnostic (GPT-4o, DeepSeek, GLM, etc.)
- Output: Excel (.xlsx) with formatted sheets, JSON with structured hierarchy
Output Format
Excel Output
- Sheet per document type
- Header row with field names
- Data rows with extracted values
- Color-coded by confidence
JSON Output
{
"doc_type": "invoice",
"fields": {
"invoice_number": "...",
"date": "...",
"amount": "...",
"buyer": "...",
"seller": "..."
},
"confidence": 0.95
}
Security Notes
- AI API calls: Uses
requests.postto OpenAI-compatible endpoints with user-provided API key (not stored) - Data storage: Uses
/tmp/pdf-extractor/for temporary processing files (no home directory write) - OCR: Local processing via EasyOCR/Tesseract (no external data transmission)
- Billing data:
FEISHU_USER_IDtransmitted toskillpay.me/api/v1/billingfor per-call charging
Billing
- Billing via
skillpay.me/api/v1/billing/charge - User data transmitted to SkillPay for billing identification
- $0.01 USD per extraction call (PRO tier)
Required Environment Variables
| Variable | Description |
|----------|-------------|
| FEISHU_USER_ID | User open_id for billing |
| SKILL_BILLING_API_KEY | SkillPay Builder API Key |
| SKILL_BILLING_SKILL_ID | SkillPay Skill ID (default: pdf-extractor) |
Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| NO_TEXT_EXTRACTED | Scanned PDF without OCR | Enable OCR or use digital PDF |
| UNSUPPORTED_DOC_TYPE | Document type not recognized | Specify type manually |
| API_ERROR | AI API key invalid or quota exceeded | Check API key |
微信扫一扫