Document Processing Pipeline
Build a workflow that processes uploaded documents through OCR, LLM-based entity extraction, classification, and stores structured results.
What You’ll Build
graph TD
subgraph "Ingestion"
A[Document Upload
PDF, Image, Scan] --> B[OCR
Extract Text]
end
subgraph "Extraction"
B --> C[LLM: Extract Entities
Names, Dates, Amounts]
B --> D[LLM: Classify Document
Invoice, Contract, Letter]
end
subgraph "Enrichment"
C --> E[LLM: Validate & Structure
JSON Output]
D --> E
end
subgraph "Storage"
E --> F[Output Dataset
Structured Records]
E --> G[Knowledge Graph
Entity Relationships]
endUse cases:
- Invoice and receipt processing
- Contract analysis and extraction
- Medical records processing
- Legal document discovery
- Insurance claims processing
- HR document digitization
Prerequisites
| Component | Description | Example |
|---|---|---|
| OCR Model | Text extraction from images/PDFs | Tesseract, PaddleOCR, Azure Document Intelligence |
| LLM Model | For NER and classification | Ollama (Llama, Mistral), GPT-4, Claude |
| Output Dataset | Structured results storage | Dataset with entity columns |
| Graph (optional) | Entity relationship storage | For linking entities across documents |
Step 1: Create the Workflow
Step 2: Add OCR Node
Extract text from uploaded documents (PDFs, images, scans).
# OCR node - extracts text from documents
ocr_node = client.create_workflow_node(
version_id=version.id,
name="Extract Text (OCR)",
entity_type="model",
entity_id=ocr_model.id,
config={
"input_template": "{{input}}",
"timeout": 120,
"config": {
"languages": ["en"], # Add more languages as needed
"dpi": 300, # Higher DPI for better accuracy
"output_format": "text_with_positions" # Include word positions
}
}
)Expected output:
{
"text": "INVOICE\n\nInvoice Number: INV-2024-0042\nDate: January 15, 2024\n\nBill To:\nAcme Corporation\n123 Business Street\nNew York, NY 10001\n\nDescription Qty Price Total\nConsulting Services 10 $150 $1,500\nSoftware License 1 $500 $500\n\nSubtotal: $2,000\nTax (8%): $160\nTotal Due: $2,160",
"pages": 1,
"confidence": 0.94,
"words": [
{"text": "INVOICE", "x": 0.4, "y": 0.05, "confidence": 0.99},
...
]
}Step 3: Add Document Classification Node
Classify the document type to determine extraction rules.
# Classification node
classify_node = client.create_workflow_node(
version_id=version.id,
name="Classify Document",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """
Classify this document into exactly ONE category based on its content.
**Document text:**
{{Extract Text (OCR)}}
**Categories:**
- invoice: Bills, invoices, payment requests
- receipt: Purchase receipts, payment confirmations
- contract: Legal agreements, terms of service, NDAs
- letter: Correspondence, formal letters
- report: Business reports, analysis documents
- form: Application forms, questionnaires
- id_document: IDs, passports, licenses
- financial: Bank statements, tax documents
- medical: Medical records, prescriptions
- other: Documents that don't fit other categories
**Return JSON only:**
```json
{
"document_type": "category_name",
"confidence": 0.0-1.0,
"reasoning": "brief explanation"
}“”", “config”: { “temperature”: 0.1, “max_tokens”: 200 } } )
Connect OCR → Classification
client.create_workflow_edge( version_id=version.id, begin_node_id=ocr_node.id, end_node_id=classify_node.id, edge_type=“data” )
## Step 4: Add Entity Extraction Schema Dataset
Define what entities to extract for each document type.
```python
# Create extraction schema dataset
schema_dataset = client.create_dataset(
name="Document Extraction Schemas",
description="Entity extraction rules per document type"
)
schema_version = client.create_dataset_version(
dataset_id=schema_dataset.id,
name="v1"
)
# Define extraction schemas per document type
schemas = [
{
"document_type": "invoice",
"entities": [
{"name": "invoice_number", "type": "string", "description": "Unique invoice identifier"},
{"name": "invoice_date", "type": "date", "description": "Date of invoice"},
{"name": "due_date", "type": "date", "description": "Payment due date"},
{"name": "vendor_name", "type": "string", "description": "Company issuing the invoice"},
{"name": "vendor_address", "type": "string", "description": "Vendor's address"},
{"name": "customer_name", "type": "string", "description": "Bill-to company or person"},
{"name": "customer_address", "type": "string", "description": "Customer's address"},
{"name": "line_items", "type": "array", "description": "List of {description, quantity, unit_price, total}"},
{"name": "subtotal", "type": "currency", "description": "Sum before tax"},
{"name": "tax_amount", "type": "currency", "description": "Tax amount"},
{"name": "tax_rate", "type": "percentage", "description": "Tax percentage"},
{"name": "total_amount", "type": "currency", "description": "Final amount due"},
{"name": "payment_terms", "type": "string", "description": "Payment conditions"},
{"name": "bank_details", "type": "string", "description": "Payment account info"}
]
},
{
"document_type": "contract",
"entities": [
{"name": "contract_title", "type": "string", "description": "Name/title of the contract"},
{"name": "contract_date", "type": "date", "description": "Date contract was signed"},
{"name": "effective_date", "type": "date", "description": "When contract takes effect"},
{"name": "expiration_date", "type": "date", "description": "When contract expires"},
{"name": "party_1", "type": "object", "description": "{name, role, address, representative}"},
{"name": "party_2", "type": "object", "description": "{name, role, address, representative}"},
{"name": "contract_value", "type": "currency", "description": "Total contract value"},
{"name": "payment_schedule", "type": "string", "description": "Payment terms and schedule"},
{"name": "key_terms", "type": "array", "description": "Important contractual obligations"},
{"name": "termination_clause", "type": "string", "description": "Conditions for termination"},
{"name": "governing_law", "type": "string", "description": "Jurisdiction/applicable law"},
{"name": "signatures", "type": "array", "description": "List of {name, title, date}"}
]
},
{
"document_type": "receipt",
"entities": [
{"name": "merchant_name", "type": "string", "description": "Store/business name"},
{"name": "merchant_address", "type": "string", "description": "Store location"},
{"name": "transaction_date", "type": "date", "description": "Date of purchase"},
{"name": "transaction_time", "type": "time", "description": "Time of purchase"},
{"name": "items", "type": "array", "description": "List of {name, quantity, price}"},
{"name": "subtotal", "type": "currency", "description": "Sum before tax"},
{"name": "tax", "type": "currency", "description": "Tax amount"},
{"name": "total", "type": "currency", "description": "Final amount paid"},
{"name": "payment_method", "type": "string", "description": "Cash, card, etc."},
{"name": "card_last_four", "type": "string", "description": "Last 4 digits if card payment"}
]
},
{
"document_type": "letter",
"entities": [
{"name": "sender_name", "type": "string", "description": "Person/org sending the letter"},
{"name": "sender_address", "type": "string", "description": "Sender's address"},
{"name": "recipient_name", "type": "string", "description": "Person/org receiving"},
{"name": "recipient_address", "type": "string", "description": "Recipient's address"},
{"name": "date", "type": "date", "description": "Date of the letter"},
{"name": "subject", "type": "string", "description": "Subject line if present"},
{"name": "main_topic", "type": "string", "description": "What the letter is about"},
{"name": "action_requested", "type": "string", "description": "Any requested actions"},
{"name": "deadline", "type": "date", "description": "Any mentioned deadlines"}
]
},
{
"document_type": "medical",
"entities": [
{"name": "patient_name", "type": "string", "description": "Patient's full name"},
{"name": "patient_dob", "type": "date", "description": "Date of birth"},
{"name": "patient_id", "type": "string", "description": "Medical record number"},
{"name": "provider_name", "type": "string", "description": "Doctor/provider name"},
{"name": "facility_name", "type": "string", "description": "Hospital/clinic name"},
{"name": "visit_date", "type": "date", "description": "Date of visit/service"},
{"name": "diagnosis", "type": "array", "description": "List of diagnoses/conditions"},
{"name": "procedures", "type": "array", "description": "Procedures performed"},
{"name": "medications", "type": "array", "description": "List of {name, dosage, frequency}"},
{"name": "follow_up", "type": "string", "description": "Follow-up instructions"}
]
}
]
for schema in schemas:
client.create_dataset_item(
version_id=schema_version.id,
data=schema
)Step 5: Add Schema Context Node
Provide extraction schemas as context for the NER node.
# Schema context node
schema_node = client.create_workflow_node(
version_id=version.id,
name="Extraction Schemas",
entity_type="dataset",
entity_id=schema_dataset.id,
config={
"context_config": {
"dataset_version_id": schema_version.id,
"field_mapping": {
"document_type": "document_type",
"entities": "entities"
},
"context_name": "schemas"
}
}
)Step 6: Add LLM Entity Extraction Node
Extract entities based on document type and schema.
# Entity extraction node
extract_node = client.create_workflow_node(
version_id=version.id,
name="Extract Entities",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """
You are a document data extraction specialist. Extract structured information from this document.
**Document Type:** {{Classify Document}}
**Document Text:**
{{Extract Text (OCR)}}
**Extraction Schema for this document type:**
{{#each schemas}}
{{#if (eq document_type ../Classify Document.document_type)}}
Extract these entities:
{{#each entities}}
- **{{name}}** ({{type}}): {{description}}
{{/each}}
{{/if}}
{{/each}}
**Instructions:**
1. Extract ONLY the entities defined in the schema above
2. Use null for any entity not found in the document
3. For arrays, return empty array [] if none found
4. For currency values, extract as numbers (e.g., 1500.00, not "$1,500")
5. For dates, use ISO format (YYYY-MM-DD)
6. Be precise - extract exactly what's in the document, don't infer
**Return valid JSON matching the schema:**
```json
{
"document_type": "...",
"extraction_confidence": 0.0-1.0,
"entities": {
// extracted entity values here
},
"extraction_notes": "any issues or ambiguities"
}“”", “config”: { “temperature”: 0.0, # Zero temperature for consistent extraction “max_tokens”: 4000 } } )
Connect OCR → Extraction
client.create_workflow_edge( version_id=version.id, begin_node_id=ocr_node.id, end_node_id=extract_node.id, edge_type=“data” )
Connect Classification → Extraction
client.create_workflow_edge( version_id=version.id, begin_node_id=classify_node.id, end_node_id=extract_node.id, edge_type=“data” )
Connect Schema → Extraction (context)
client.create_workflow_edge( version_id=version.id, begin_node_id=schema_node.id, end_node_id=extract_node.id, edge_type=“context” )
## Step 7: Add Validation Node
Validate extracted entities and flag issues.
```python
# Validation node
validate_node = client.create_workflow_node(
version_id=version.id,
name="Validate Extraction",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """
Validate the extracted data against the original document.
**Original Document Text:**
{{Extract Text (OCR)}}
**Extracted Data:**
{{Extract Entities}}
**Validation Checks:**
1. **Completeness**: Are all required fields extracted?
2. **Accuracy**: Do extracted values match the document?
3. **Format**: Are dates, currencies, numbers in correct format?
4. **Consistency**: Do calculated fields match (e.g., line items sum to subtotal)?
5. **Anomalies**: Any unusual values or potential errors?
**Return validation result as JSON:**
```json
{
"is_valid": true|false,
"completeness_score": 0.0-1.0,
"accuracy_score": 0.0-1.0,
"issues": [
{
"field": "field_name",
"issue_type": "missing|incorrect|format_error|inconsistent",
"description": "what's wrong",
"suggested_fix": "correction if possible"
}
],
"corrected_entities": {
// Only include fields that need correction
},
"requires_human_review": true|false,
"review_reason": "why human review needed, if applicable"
}“”", “config”: { “temperature”: 0.1, “max_tokens”: 2000 } } )
Connect OCR → Validation (for reference)
client.create_workflow_edge( version_id=version.id, begin_node_id=ocr_node.id, end_node_id=validate_node.id, edge_type=“data” )
Connect Extraction → Validation
client.create_workflow_edge( version_id=version.id, begin_node_id=extract_node.id, end_node_id=validate_node.id, edge_type=“data” )
## Step 8: Add Output Dataset
Store processed documents with extracted entities.
```python
# Create output dataset
output_dataset = client.create_dataset(
name="Processed Documents",
description="Documents with extracted entities"
)
output_version = client.create_dataset_version(
dataset_id=output_dataset.id,
name="v1"
)
# Output node
output_node = client.create_workflow_node(
version_id=version.id,
name="Store Results",
entity_type="dataset",
entity_id=output_dataset.id,
config={
"output_dataset_id": output_dataset.id,
"output_version_id": output_version.id,
"column_mapping": {
"source_file": "{{input}}",
"ocr_text": "{{Extract Text (OCR)}}",
"document_type": "{{Classify Document}}",
"extracted_entities": "{{Extract Entities}}",
"validation_result": "{{Validate Extraction}}",
"processed_at": "{{timestamp}}"
}
}
)
# Connect Validation → Output
client.create_workflow_edge(
version_id=version.id,
begin_node_id=validate_node.id,
end_node_id=output_node.id,
edge_type="data"
)Step 9 (Optional): Add Graph Storage
Store entities and relationships in a knowledge graph.
# Create graph for entity relationships
doc_graph = client.create_graph(
name="Document Entities",
description="Entities extracted from documents and their relationships"
)
# Define node types
client.create_graph_node_type(
graph_id=doc_graph.id,
name="Organization",
properties=["name", "address", "type"]
)
client.create_graph_node_type(
graph_id=doc_graph.id,
name="Person",
properties=["name", "role", "email"]
)
client.create_graph_node_type(
graph_id=doc_graph.id,
name="Document",
properties=["type", "date", "reference_number", "amount"]
)
# Define edge types
client.create_graph_edge_type(
graph_id=doc_graph.id,
name="ISSUED_BY",
from_type="Document",
to_type="Organization"
)
client.create_graph_edge_type(
graph_id=doc_graph.id,
name="SENT_TO",
from_type="Document",
to_type="Organization"
)
client.create_graph_edge_type(
graph_id=doc_graph.id,
name="SIGNED_BY",
from_type="Document",
to_type="Person"
)
# Graph storage node
graph_node = client.create_workflow_node(
version_id=version.id,
name="Store in Graph",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """
Convert extracted document data into graph operations.
**Extracted Data:**
{{Extract Entities}}
**Document Type:** {{Classify Document}}
**Generate graph operations as JSON:**
```json
{
"nodes": [
{
"type": "Organization|Person|Document",
"id": "unique_identifier",
"properties": {...}
}
],
"edges": [
{
"type": "ISSUED_BY|SENT_TO|SIGNED_BY",
"from_id": "node_id",
"to_id": "node_id"
}
]
}Rules:
- Create Organization nodes for vendors, customers, parties
- Create Person nodes for signatories, contacts
- Create Document node for the document itself
- Link with appropriate edge types “”", “config”: { “temperature”: 0.0, “max_tokens”: 2000 } } )
Connect Extraction → Graph
client.create_workflow_edge( version_id=version.id, begin_node_id=extract_node.id, end_node_id=graph_node.id, edge_type=“data” )
Connect Classification → Graph
client.create_workflow_edge( version_id=version.id, begin_node_id=classify_node.id, end_node_id=graph_node.id, edge_type=“data” )
## Complete Workflow Code
```python
from seeme import Client
client = Client()
# --- Get your models ---
ocr_model = client.get_model("your-ocr-model-id")
llm_model = client.get_model("your-llm-model-id")
# --- Create workflow ---
workflow = client.create_workflow(
name="Document Processing Pipeline",
description="OCR → Classification → Entity Extraction → Validation → Storage"
)
version = client.create_workflow_version(workflow_id=workflow.id, name="v1")
# --- Node 1: OCR ---
ocr_node = client.create_workflow_node(
version_id=version.id,
name="Extract Text (OCR)",
entity_type="model",
entity_id=ocr_model.id,
config={
"input_template": "{{input}}",
"timeout": 120,
"config": {"languages": ["en"], "dpi": 300}
}
)
# --- Node 2: Classification ---
classify_node = client.create_workflow_node(
version_id=version.id,
name="Classify Document",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """Classify this document into one category: invoice, receipt, contract, letter, report, form, medical, financial, other.
Document text:
{{Extract Text (OCR)}}
Return JSON: {"document_type": "...", "confidence": 0.0-1.0}""",
"config": {"temperature": 0.1, "max_tokens": 200}
}
)
# --- Node 3: Extraction Schema (context) ---
schema_node = client.create_workflow_node(
version_id=version.id,
name="Extraction Schemas",
entity_type="dataset",
entity_id=schema_dataset.id,
config={
"context_config": {
"dataset_version_id": schema_version.id,
"context_name": "schemas"
}
}
)
# --- Node 4: Entity Extraction ---
extract_node = client.create_workflow_node(
version_id=version.id,
name="Extract Entities",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """Extract entities from this {{Classify Document.document_type}} document.
Document text:
{{Extract Text (OCR)}}
Schema:
{{#each schemas}}{{#if (eq document_type ../Classify Document.document_type)}}{{entities}}{{/if}}{{/each}}
Return JSON with extracted entities.""",
"config": {"temperature": 0.0, "max_tokens": 4000}
}
)
# --- Node 5: Validation ---
validate_node = client.create_workflow_node(
version_id=version.id,
name="Validate Extraction",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """Validate extraction accuracy.
Original: {{Extract Text (OCR)}}
Extracted: {{Extract Entities}}
Return JSON: {is_valid, issues[], corrected_entities, requires_human_review}""",
"config": {"temperature": 0.1, "max_tokens": 2000}
}
)
# --- Node 6: Output Dataset ---
output_node = client.create_workflow_node(
version_id=version.id,
name="Store Results",
entity_type="dataset",
entity_id=output_dataset.id,
config={
"output_dataset_id": output_dataset.id,
"column_mapping": {
"source_file": "{{input}}",
"document_type": "{{Classify Document}}",
"extracted_entities": "{{Extract Entities}}",
"validation": "{{Validate Extraction}}"
}
}
)
# --- Connect nodes ---
edges = [
(ocr_node.id, classify_node.id, "data"),
(ocr_node.id, extract_node.id, "data"),
(classify_node.id, extract_node.id, "data"),
(schema_node.id, extract_node.id, "context"),
(ocr_node.id, validate_node.id, "data"),
(extract_node.id, validate_node.id, "data"),
(validate_node.id, output_node.id, "data"),
]
for begin_id, end_id, edge_type in edges:
client.create_workflow_edge(
version_id=version.id,
begin_node_id=begin_id,
end_node_id=end_id,
edge_type=edge_type
)
print(f"Workflow ready: {workflow.id}")Execute the Workflow
Single Document
# Process a single document
execution = client.execute_workflow(
workflow_id=workflow.id,
input_mode="single",
item="./documents/invoice_001.pdf"
)
# Monitor
import time
while execution.status in ["pending", "running"]:
time.sleep(5)
execution = client.get_workflow_execution(
workflow_id=workflow.id,
execution_id=execution.id
)
print(f"Status: {execution.status}")
# Results
import json
print("Document Type:", execution.results["Classify Document"])
print("Extracted Entities:", json.dumps(
execution.results["Extract Entities"],
indent=2
))
print("Validation:", execution.results["Validate Extraction"])Batch Processing
import glob
# Process all documents in a folder
documents = glob.glob("./documents/**/*.pdf", recursive=True)
documents += glob.glob("./documents/**/*.jpg", recursive=True)
documents += glob.glob("./documents/**/*.png", recursive=True)
print(f"Processing {len(documents)} documents...")
execution = client.execute_workflow(
workflow_id=workflow.id,
input_mode="batch",
batch_config={
"file_paths": documents,
"parallelism": 10,
"continue_on_error": True
}
)
# Monitor batch
while execution.status in ["pending", "running"]:
time.sleep(30)
execution = client.get_workflow_execution(
workflow_id=workflow.id,
execution_id=execution.id
)
print(f"Progress: {execution.completed}/{execution.total}")
# Summary
print(f"\nCompleted: {execution.completed}")
print(f"Failed: {execution.failed}")
print(f"Requires review: {sum(1 for r in execution.results if r.get('requires_human_review'))}")Using Post-Processors for Automatic Processing
Set up automatic processing when documents are uploaded:
# Create post-processor to run workflow on uploads
processor = client.create_post_processor(
dataset_id=inbox_dataset.id,
name="Auto-Process Documents",
processor_type="workflow",
workflow_id=workflow.id,
enabled=True
)
# Now any document uploaded to inbox_dataset will be processed automaticallyExample Output
For an invoice document, the output might look like:
{
"document_type": {
"document_type": "invoice",
"confidence": 0.95,
"reasoning": "Contains invoice number, line items, and total amount due"
},
"extracted_entities": {
"document_type": "invoice",
"extraction_confidence": 0.92,
"entities": {
"invoice_number": "INV-2024-0042",
"invoice_date": "2024-01-15",
"due_date": "2024-02-15",
"vendor_name": "Tech Solutions Inc.",
"vendor_address": "456 Tech Park, San Francisco, CA 94102",
"customer_name": "Acme Corporation",
"customer_address": "123 Business Street, New York, NY 10001",
"line_items": [
{"description": "Consulting Services", "quantity": 10, "unit_price": 150.00, "total": 1500.00},
{"description": "Software License", "quantity": 1, "unit_price": 500.00, "total": 500.00}
],
"subtotal": 2000.00,
"tax_rate": 0.08,
"tax_amount": 160.00,
"total_amount": 2160.00,
"payment_terms": "Net 30",
"bank_details": null
}
},
"validation": {
"is_valid": true,
"completeness_score": 0.93,
"accuracy_score": 0.98,
"issues": [
{
"field": "bank_details",
"issue_type": "missing",
"description": "No bank/payment details found in document"
}
],
"requires_human_review": false
}
}Workflow Diagram
graph TD
A[Document Upload] --> B[OCR: Extract Text]
B --> C[LLM: Classify Document]
B --> D[LLM: Extract Entities]
C --> D
E[Extraction Schemas] -.->|context| D
B --> F[LLM: Validate]
D --> F
F --> G[Output Dataset]
D --> H[Knowledge Graph]
C --> H
style A fill:#e0f2fe
style B fill:#fef3c7
style C fill:#dbeafe
style D fill:#d1fae5
style E fill:#f3f4f6
style F fill:#fce7f3
style G fill:#e0f2fe
style H fill:#ede9feCustomization
Add Language Detection
lang_detect_node = client.create_workflow_node(
version_id=version.id,
name="Detect Language",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """Detect the language of this text. Return JSON: {"language": "en|de|fr|es|...", "confidence": 0.0-1.0}
Text:
{{Extract Text (OCR)}}"""
}
)Add Translation Step
translate_node = client.create_workflow_node(
version_id=version.id,
name="Translate to English",
entity_type="model",
entity_id=llm_model.id,
config={
"input_template": """Translate this document to English. Preserve formatting.
Original ({{Detect Language.language}}):
{{Extract Text (OCR)}}
English translation:"""
}
)Custom Document Types
Add new document types to the schema dataset:
custom_schema = {
"document_type": "purchase_order",
"entities": [
{"name": "po_number", "type": "string", "description": "Purchase order number"},
{"name": "order_date", "type": "date", "description": "Date PO was created"},
{"name": "delivery_date", "type": "date", "description": "Expected delivery date"},
{"name": "buyer", "type": "object", "description": "{name, contact, address}"},
{"name": "supplier", "type": "object", "description": "{name, contact, address}"},
{"name": "items", "type": "array", "description": "List of ordered items"},
{"name": "total_value", "type": "currency", "description": "Total PO value"},
{"name": "shipping_terms", "type": "string", "description": "Shipping/delivery terms"},
{"name": "payment_terms", "type": "string", "description": "Payment conditions"}
]
}
client.create_dataset_item(
version_id=schema_version.id,
data=custom_schema
)Best Practices
- Use high DPI (300+) for OCR on scanned documents
- Zero temperature for entity extraction ensures consistency
- Validate everything - LLMs can hallucinate fields
- Schema-driven extraction makes adding new document types easy
- Batch similar documents for better throughput
- Flag for human review when confidence is low or validation fails
- Store original text alongside extractions for auditing
Troubleshooting
| Issue | Solution |
|---|---|
| Poor OCR quality | Increase DPI, try different OCR model, preprocess images |
| Wrong classification | Add more document type examples, improve prompts |
| Missing entities | Check schema completeness, lower confidence requirements |
| Incorrect extractions | Review OCR output, improve extraction prompts |
| Validation false positives | Tune validation prompt, accept minor format differences |