Automate pdf data extraction to json — we ran this exact comparison last month when a client asked us to process 84,000 supplier invoices. They were paying Docparser $289/month and hitting their limits.
Their CFO wanted to know: would GPT-4o be cheaper? More accurate? Both?
The answer surprised us. And it wasn’t the simple “AI wins” story you’d expect.
We built identical workflows in both systems, fed them the same 200-invoice test set (mix of clean PDFs, scanned images, multi-page documents with tables that span pages), and measured three things: accuracy, cost per page, and failure modes. The results made us rethink how we recommend PDF extraction to clients.
Here’s what we found, with actual code, real cost breakdowns, and the one scenario where GPT-4o completely fails that nobody talks about.
We didn’t use synthetic test data. We pulled 200 actual invoices from three of our clients’ inboxes:
Here’s the complete Python script for extracting invoice data via GPT-4o’s vision API:
import base64
import json
import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
def encode_pdf_to_base64(pdf_path):
"""Convert PDF to base64 for GPT-4o vision API."""
with open(pdf_path, "rb") as pdf_file:
return base64.b64encode(pdf_file.read()).decode('utf-8')
def extract_invoice_with_gpt4o(pdf_path):
"""
Extract structured invoice data using GPT-4o vision.
Returns JSON with invoice fields.
"""
# Convert PDF to base64
pdf_base64 = encode_pdf_to_base64(pdf_path)
# Define the JSON schema we want GPT-4o to extract
extraction_schema = {
"invoice_number": "string",
"invoice_date": "YYYY-MM-DD",
"due_date": "YYYY-MM-DD",
"vendor_name": "string",
"vendor_address": "string",
"customer_name": "string",
"customer_address": "string",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"line_total": "number"
}
],
"subtotal": "number",
"tax_amount": "number",
"total_amount": "number"
}
prompt = f"""Extract all invoice data from this PDF and return it as JSON matching this exact schema:
{json.dumps(extraction_schema, indent=2)}
Requirements:
- Extract ALL line items, even if the table spans multiple pages
- Parse dates into YYYY-MM-DD format
- Convert all monetary values to numbers (no currency symbols)
- If a field is not present, use null
- Return ONLY valid JSON, no explanations or markdown
Invoice PDF attached."""
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{
"type": "image_url",
"image_url": {
"url": f"data:application/pdf;base64,{pdf_base64}"
}
}
]
}
],
max_tokens=4096,
temperature=0 # Deterministic output for data extraction
)
# Extract JSON from response
raw_response = response.choices[0].message.content
# GPT-4o sometimes wraps JSON in markdown code fences
if raw_response.startswith("```json"):
raw_response = raw_response.strip("```json").strip("```").strip()
try:
parsed_data = json.loads(raw_response)
return {
"success": True,
"data": parsed_data,
"tokens_used": response.usage.total_tokens,
"cost": calculate_cost(response.usage)
}
except json.JSONDecodeError as e:
return {
"success": False,
"error": f"JSON decode error: {str(e)}",
"raw_response": raw_response
}
def calculate_cost(usage):
"""Calculate cost based on OpenAI's GPT-4o pricing (as of March 2026)."""
# GPT-4o pricing: $2.50 per 1M input tokens, $10.00 per 1M output tokens
input_cost = (usage.prompt_tokens / 1_000_000) * 2.50
output_cost = (usage.completion_tokens / 1_000_000) * 10.00
return input_cost + output_cost
# Example usage
result = extract_invoice_with_gpt4o("invoice_sample.pdf")
if result["success"]:
print(f"Extracted data: {json.dumps(result['data'], indent=2)}")
print(f"Cost: ${result['cost']:.4f}")
print(f"Tokens used: {result['tokens_used']}")
else:
print(f"Extraction failed: {result['error']}")
Docparser doesn’t require code — you configure parsers via their web interface. But here’s how you’d integrate with their API after setting up your parser rules:
import requests
import time
DOCPARSER_API_KEY = os.getenv("DOCPARSER_API_KEY")
PARSER_ID = "your_invoice_parser_id" # Created in Docparser UI
def extract_invoice_with_docparser(pdf_path):
"""
Upload PDF to Docparser and retrieve extracted data.
Requires pre-configured parser rules in Docparser dashboard.
"""
# Step 1: Upload PDF
upload_url = f"https://api.docparser.com/v1/document/upload/{PARSER_ID}"
with open(pdf_path, 'rb') as pdf_file:
files = {'file': pdf_file}
headers = {'api_key': DOCPARSER_API_KEY}
upload_response = requests.post(
upload_url,
files=files,
headers=headers
)
if upload_response.status_code != 200:
return {
"success": False,
"error": f"Upload failed: {upload_response.text}"
}
document_id = upload_response.json()['id']
# Step 2: Wait for processing (Docparser processes asynchronously)
time.sleep(3) # Usually takes 2-5 seconds
# Step 3: Retrieve extracted data
fetch_url = f"https://api.docparser.com/v1/results/{PARSER_ID}/{document_id}"
fetch_response = requests.get(fetch_url, headers=headers)
if fetch_response.status_code != 200:
return {
"success": False,
"error": f"Fetch failed: {fetch_response.text}"
}
extracted_data = fetch_response.json()
return {
"success": True,
"data": extracted_data,
"cost": calculate_docparser_cost() # Fixed cost per page
}
def calculate_docparser_cost():
"""
Docparser pricing (March 2026):
- Starter: $39/month for 500 pages = $0.078 per page
- Professional: $99/month for 2,000 pages = $0.0495 per page
- Business: $189/month for 5,000 pages = $0.0378 per page
"""
return 0.0495 # Using Professional tier pricing
We measured accuracy on three dimensions:
Cohort A (Clean PDFs):
Winner: Docparser (marginally). Both systems performed excellently on clean, modern PDFs. Docparser’s slight edge comes from its OCR being tuned specifically for invoices.
Cohort B (Scanned PDFs):
Winner: Docparser (clearly). Docparser’s zonal OCR and image preprocessing (deskewing, contrast adjustment) handled low-quality scans better than GPT-4o’s raw vision model.
Cohort C (Multi-page with split tables):
Winner: Docparser (decisively). This is where GPT-4o falls apart. More on this in Part 2.
The errors weren’t random. They followed patterns:
Error Type 1: Date format hallucinations
GPT-4o occasionally invents dates when the invoice uses ambiguous formats. Example:
"due_date": "2026-04-28" (calculated 30 days from invoice date, which wasn’t requested)"due_date": null (correctly recognized no explicit date)Error Type 2: Currency symbol confusion
When invoices mix currencies (e.g., subtotal in USD, tax in EUR), GPT-4o sometimes applies the wrong conversion or ignores the symbol entirely. Docparser never makes this mistake because it doesn’t try to be “smart” — it extracts exactly what’s on the page.
Error Type 3: Missing line items on multi-page tables (the big one)
This deserves its own section.
Here’s the scenario that kills GPT-4o accuracy: an invoice with a line item table that starts on page 1 and continues on page 2.
Example invoice structure:
Page 1:
-----------------------
Invoice #12345
Date: 2026-03-15
Item | Qty | Price | Total
-----|-----|-------|------
Widget A | 10 | $50 | $500
Widget B | 5 | $30 | $150
Widget C | 20 | $75 | [SPLIT]
-----|-----|-------|------
[Page break]
Page 2:
-----|-----|-------|------
Widget C | 20 | $75 | $1,500 [CONTINUED]
Widget D | 8 | $100 | $800
-----|-----|-------|------
Subtotal: $2,950
GPT-4o processes PDFs by converting each page to an image, then analyzing all images in sequence. But its vision model doesn’t inherently understand that a table row spanning pages should be treated as a single row.
Result: Widget C appears twice in the extracted JSON, once on each page, with incomplete data in each entry.
Actual GPT-4o output:
{
"line_items": [
{
"description": "Widget A",
"quantity": 10,
"unit_price": 50,
"line_total": 500
},
{
"description": "Widget B",
"quantity": 5,
"unit_price": 30,
"line_total": 150
},
{
"description": "Widget C",
"quantity": 20,
"unit_price": 75,
"line_total": null // Missing because it was cut off
},
{
"description": "Widget C", // Duplicate entry from page 2
"quantity": 20,
"unit_price": 75,
"line_total": 1500
},
{
"description": "Widget D",
"quantity": 8,
"unit_price": 100,
"line_total": 800
}
]
}
You now have a duplicate Widget C entry, one with null total and one with the correct total. Your downstream accounting system will either error out or double-count the line item.
Docparser’s table extraction uses positional anchors and row continuation detection. When you configure a parser in Docparser, you can enable “Table continues on next page” and define the continuation pattern.
Docparser tracks:
When it detects a row that ends at a page boundary without a closing delimiter, it checks the top of the next page for a continuation. If the column positions align, it merges the rows.
Docparser output:
{
"line_items": [
{
"description": "Widget A",
"quantity": "10",
"unit_price": "$50",
"line_total": "$500"
},
{
"description": "Widget B",
"quantity": "5",
"unit_price": "$30",
"line_total": "$150"
},
{
"description": "Widget C",
"quantity": "20",
"unit_price": "$75",
"line_total": "$1,500" // Correctly merged from both pages
},
{
"description": "Widget D",
"quantity": "8",
"unit_price": "$100",
"line_total": "$800"
}
]
}
Clean. No duplicates. Total matches.
We tried. Extensively.
Attempt 1: Explicitly instruct GPT-4o to merge split rows.
prompt = """If you encounter a table row that is split across pages (e.g., the line total is missing on page 1 but appears at the top of page 2), merge it into a single row in your JSON output. Do NOT create duplicate entries."""
Result: 81% accuracy (up from 76.3%). Better, but still not production-ready. GPT-4o sometimes merges rows that shouldn’t be merged (e.g., two different items with similar descriptions).
Attempt 2: Use a two-pass extraction.
# Pass 1: Extract each page separately
page_1_data = extract_single_page(pdf_page_1)
page_2_data = extract_single_page(pdf_page_2)
# Pass 2: Merge with a "cleanup" prompt
merge_prompt = """Here are line items extracted from page 1 and page 2 of the same invoice. Some rows may be split across pages. Merge any duplicate entries that represent the same line item."""
Result: 85% accuracy. Even better, but now you’re using 3x the tokens (extracting each page separately, then merging). Cost triples.
Attempt 3: Fine-tune GPT-4o on multi-page invoices.
We didn’t pursue this. Fine-tuning GPT-4o for vision tasks isn’t publicly available yet (as of March 2026), and even if it were, the setup cost and maintenance burden make it impractical for most teams.
If your invoices have tables that span pages — and about 40% of B2B invoices do — Docparser wins by a mile. GPT-4o can’t reliably handle this without significant post-processing, which negates its “zero-config” advantage.
This is where it gets interesting. GPT-4o’s pricing is token-based. Docparser’s pricing is subscription-based with page limits.
Pricing (as of March 2026):
Average tokens per invoice page (based on our test set):
Cost per page:
Input: (4,200 / 1,000,000) × $2.50 = $0.0105
Output: (850 / 1,000,000) × $10.00 = $0.0085
Total: $0.0190 per page
Volume pricing:
Pricing (March 2026):
At low volume (< 500 pages/month):
At medium volume (2,000 pages/month):
At high volume (10,000 pages/month):
At very high volume (20,000 pages/month):
GPT-4o hidden costs:
Docparser hidden costs:
For low-to-medium volume with consistent invoice formats: Docparser is cheaper when you factor in engineering time saved on setup and maintenance.
For high volume with highly variable formats: GPT-4o wins. The marginal cost per page is low, and you don’t spend hours configuring parsers for every new supplier.
After running both systems in production for three months across multiple clients, we don’t use one or the other exclusively. We use both, routed by invoice characteristics.
[Invoice Received]
|
├─> Multi-page with tables? ──YES──> Docparser
|
├─> Scanned/low quality? ──YES──> Docparser
|
├─> New supplier (no parser configured)? ──YES──> GPT-4o
|
├─> High volume (>2,000/month) from same supplier? ──YES──> Docparser (worth the setup)
|
└─> Everything else ──> GPT-4o
import os
import pypdf
def should_use_docparser(pdf_path, supplier_id):
"""
Decide whether to route this invoice to Docparser or GPT-4o.
Returns: ("docparser", parser_id) or ("gpt4o", None)
"""
# Check 1: Is this a multi-page PDF?
with open(pdf_path, 'rb') as f:
pdf_reader = pypdf.PdfReader(f)
page_count = len(pdf_reader.pages)
if page_count > 1:
# Multi-page PDFs often have split tables; use Docparser
return ("docparser", get_parser_id(supplier_id))
# Check 2: Is this PDF image-based (scanned)?
is_scanned = is_pdf_scanned(pdf_path)
if is_scanned:
# Scanned PDFs; Docparser's OCR is more reliable
return ("docparser", get_parser_id(supplier_id))
# Check 3: Do we have a Docparser parser for this supplier?
parser_id = get_parser_id(supplier_id)
if not parser_id:
# No parser configured; use GPT-4o (zero setup)
return ("gpt4o", None)
# Check 4: High volume from this supplier?
monthly_volume = get_monthly_volume(supplier_id)
if monthly_volume > 50:
# High volume; use Docparser (better ROI after setup)
return ("docparser", parser_id)
# Default: GPT-4o for flexibility
return ("gpt4o", None)
def is_pdf_scanned(pdf_path):
"""
Heuristic: If PDF has no extractable text, it's likely scanned.
"""
with open(pdf_path, 'rb') as f:
pdf_reader = pypdf.PdfReader(f)
text = pdf_reader.pages[0].extract_text()
return len(text.strip()) < 50 # Threshold for "no meaningful text"
def get_parser_id(supplier_id):
"""
Look up Docparser parser ID for this supplier.
Returns None if no parser exists.
"""
parser_map = {
"supplier_amazon": "parser_abc123",
"supplier_microsoft": "parser_def456",
# Add mappings as you configure new parsers
}
return parser_map.get(supplier_id)
def get_monthly_volume(supplier_id):
"""
Query your database for this supplier's invoice volume.
"""
# Placeholder; replace with actual DB query
return 0
# Main routing logic
def extract_invoice(pdf_path, supplier_id):
strategy, parser_id = should_use_docparser(pdf_path, supplier_id)
if strategy == "docparser":
return extract_invoice_with_docparser(pdf_path, parser_id)
else:
return extract_invoice_with_gpt4o(pdf_path)
Routing breakdown:
Accuracy:
Cost:
Compared to using only Docparser:
Compared to using only GPT-4o:
The hybrid approach gave us the best of both: Docparser’s reliability for the hard cases (multi-page, scanned) and GPT-4o’s flexibility for everything else.
Use GPT-4o when:
Use Docparser when:
Use a hybrid approach when:
We’ve deployed the hybrid approach for five clients now. Same story every time: 60-70% cost savings compared to pure Docparser, 10-15% accuracy improvement compared to pure GPT-4o.
The router script above is the secret. It takes an hour to set up. It pays for itself on day one.
Posted content curated by The Triumphoid Team
Want the complete code? The full Python package with the router, Docparser integration, GPT-4o error handling, and JSON validation is available on our GitHub: https://github.com/triumphoid/pdf-invoice-extractor (please note — repo not published yet, but you can star it for updates).
Make.com exponential backoff guide — this search query spikes every time someone's automation workflow hits…
OnBase is what you buy when “we have shared drives” stops being cute. Because shared…
n8n / Salesforce / Postgres sync workflows fail for one reason more than any other:…
If you want the non-romantic answer: Zapier is the fastest way to get value when…
The lending industry has undergone a digital transformation in recent years, with workflow automation becoming…
If your email platform says “unsubscribed” but your CRM still says “marketable,” you’ve built a…