Document Processing Pipeline

Document Processing Pipeline

Build a workflow that processes uploaded documents through OCR, LLM-based entity extraction, classification, and stores structured results.

What You’ll Build

graph TD
    subgraph "Ingestion"
        A[Document Upload
PDF, Image, Scan] --> B[OCR
Extract Text] end subgraph "Extraction" B --> C[LLM: Extract Entities
Names, Dates, Amounts] B --> D[LLM: Classify Document
Invoice, Contract, Letter] end subgraph "Enrichment" C --> E[LLM: Validate & Structure
JSON Output] D --> E end subgraph "Storage" E --> F[Output Dataset
Structured Records] E --> G[Knowledge Graph
Entity Relationships] end

Use cases:

  • Invoice and receipt processing
  • Contract analysis and extraction
  • Medical records processing
  • Legal document discovery
  • Insurance claims processing
  • HR document digitization

Prerequisites

ComponentDescriptionExample
OCR ModelText extraction from images/PDFsTesseract, PaddleOCR, Azure Document Intelligence
LLM ModelFor NER and classificationOllama (Llama, Mistral), GPT-4, Claude
Output DatasetStructured results storageDataset with entity columns
Graph (optional)Entity relationship storageFor linking entities across documents

Step 1: Create the Workflow

Step 2: Add OCR Node

Extract text from uploaded documents (PDFs, images, scans).

# OCR node - extracts text from documents
ocr_node = client.create_workflow_node(
    version_id=version.id,
    name="Extract Text (OCR)",
    entity_type="model",
    entity_id=ocr_model.id,
    config={
        "input_template": "{{input}}",
        "timeout": 120,
        "config": {
            "languages": ["en"],  # Add more languages as needed
            "dpi": 300,           # Higher DPI for better accuracy
            "output_format": "text_with_positions"  # Include word positions
        }
    }
)

Expected output:

{
  "text": "INVOICE\n\nInvoice Number: INV-2024-0042\nDate: January 15, 2024\n\nBill To:\nAcme Corporation\n123 Business Street\nNew York, NY 10001\n\nDescription          Qty    Price    Total\nConsulting Services   10    $150    $1,500\nSoftware License       1    $500      $500\n\nSubtotal: $2,000\nTax (8%): $160\nTotal Due: $2,160",
  "pages": 1,
  "confidence": 0.94,
  "words": [
    {"text": "INVOICE", "x": 0.4, "y": 0.05, "confidence": 0.99},
    ...
  ]
}

Step 3: Add Document Classification Node

Classify the document type to determine extraction rules.

# Classification node
classify_node = client.create_workflow_node(
    version_id=version.id,
    name="Classify Document",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """
Classify this document into exactly ONE category based on its content.

**Document text:**
{{Extract Text (OCR)}}

**Categories:**
- invoice: Bills, invoices, payment requests
- receipt: Purchase receipts, payment confirmations
- contract: Legal agreements, terms of service, NDAs
- letter: Correspondence, formal letters
- report: Business reports, analysis documents
- form: Application forms, questionnaires
- id_document: IDs, passports, licenses
- financial: Bank statements, tax documents
- medical: Medical records, prescriptions
- other: Documents that don't fit other categories

**Return JSON only:**
```json
{
  "document_type": "category_name",
  "confidence": 0.0-1.0,
  "reasoning": "brief explanation"
}

“”", “config”: { “temperature”: 0.1, “max_tokens”: 200 } } )

Connect OCR → Classification

client.create_workflow_edge( version_id=version.id, begin_node_id=ocr_node.id, end_node_id=classify_node.id, edge_type=“data” )


## Step 4: Add Entity Extraction Schema Dataset

Define what entities to extract for each document type.

```python
# Create extraction schema dataset
schema_dataset = client.create_dataset(
    name="Document Extraction Schemas",
    description="Entity extraction rules per document type"
)

schema_version = client.create_dataset_version(
    dataset_id=schema_dataset.id,
    name="v1"
)

# Define extraction schemas per document type
schemas = [
    {
        "document_type": "invoice",
        "entities": [
            {"name": "invoice_number", "type": "string", "description": "Unique invoice identifier"},
            {"name": "invoice_date", "type": "date", "description": "Date of invoice"},
            {"name": "due_date", "type": "date", "description": "Payment due date"},
            {"name": "vendor_name", "type": "string", "description": "Company issuing the invoice"},
            {"name": "vendor_address", "type": "string", "description": "Vendor's address"},
            {"name": "customer_name", "type": "string", "description": "Bill-to company or person"},
            {"name": "customer_address", "type": "string", "description": "Customer's address"},
            {"name": "line_items", "type": "array", "description": "List of {description, quantity, unit_price, total}"},
            {"name": "subtotal", "type": "currency", "description": "Sum before tax"},
            {"name": "tax_amount", "type": "currency", "description": "Tax amount"},
            {"name": "tax_rate", "type": "percentage", "description": "Tax percentage"},
            {"name": "total_amount", "type": "currency", "description": "Final amount due"},
            {"name": "payment_terms", "type": "string", "description": "Payment conditions"},
            {"name": "bank_details", "type": "string", "description": "Payment account info"}
        ]
    },
    {
        "document_type": "contract",
        "entities": [
            {"name": "contract_title", "type": "string", "description": "Name/title of the contract"},
            {"name": "contract_date", "type": "date", "description": "Date contract was signed"},
            {"name": "effective_date", "type": "date", "description": "When contract takes effect"},
            {"name": "expiration_date", "type": "date", "description": "When contract expires"},
            {"name": "party_1", "type": "object", "description": "{name, role, address, representative}"},
            {"name": "party_2", "type": "object", "description": "{name, role, address, representative}"},
            {"name": "contract_value", "type": "currency", "description": "Total contract value"},
            {"name": "payment_schedule", "type": "string", "description": "Payment terms and schedule"},
            {"name": "key_terms", "type": "array", "description": "Important contractual obligations"},
            {"name": "termination_clause", "type": "string", "description": "Conditions for termination"},
            {"name": "governing_law", "type": "string", "description": "Jurisdiction/applicable law"},
            {"name": "signatures", "type": "array", "description": "List of {name, title, date}"}
        ]
    },
    {
        "document_type": "receipt",
        "entities": [
            {"name": "merchant_name", "type": "string", "description": "Store/business name"},
            {"name": "merchant_address", "type": "string", "description": "Store location"},
            {"name": "transaction_date", "type": "date", "description": "Date of purchase"},
            {"name": "transaction_time", "type": "time", "description": "Time of purchase"},
            {"name": "items", "type": "array", "description": "List of {name, quantity, price}"},
            {"name": "subtotal", "type": "currency", "description": "Sum before tax"},
            {"name": "tax", "type": "currency", "description": "Tax amount"},
            {"name": "total", "type": "currency", "description": "Final amount paid"},
            {"name": "payment_method", "type": "string", "description": "Cash, card, etc."},
            {"name": "card_last_four", "type": "string", "description": "Last 4 digits if card payment"}
        ]
    },
    {
        "document_type": "letter",
        "entities": [
            {"name": "sender_name", "type": "string", "description": "Person/org sending the letter"},
            {"name": "sender_address", "type": "string", "description": "Sender's address"},
            {"name": "recipient_name", "type": "string", "description": "Person/org receiving"},
            {"name": "recipient_address", "type": "string", "description": "Recipient's address"},
            {"name": "date", "type": "date", "description": "Date of the letter"},
            {"name": "subject", "type": "string", "description": "Subject line if present"},
            {"name": "main_topic", "type": "string", "description": "What the letter is about"},
            {"name": "action_requested", "type": "string", "description": "Any requested actions"},
            {"name": "deadline", "type": "date", "description": "Any mentioned deadlines"}
        ]
    },
    {
        "document_type": "medical",
        "entities": [
            {"name": "patient_name", "type": "string", "description": "Patient's full name"},
            {"name": "patient_dob", "type": "date", "description": "Date of birth"},
            {"name": "patient_id", "type": "string", "description": "Medical record number"},
            {"name": "provider_name", "type": "string", "description": "Doctor/provider name"},
            {"name": "facility_name", "type": "string", "description": "Hospital/clinic name"},
            {"name": "visit_date", "type": "date", "description": "Date of visit/service"},
            {"name": "diagnosis", "type": "array", "description": "List of diagnoses/conditions"},
            {"name": "procedures", "type": "array", "description": "Procedures performed"},
            {"name": "medications", "type": "array", "description": "List of {name, dosage, frequency}"},
            {"name": "follow_up", "type": "string", "description": "Follow-up instructions"}
        ]
    }
]

for schema in schemas:
    client.create_dataset_item(
        version_id=schema_version.id,
        data=schema
    )

Step 5: Add Schema Context Node

Provide extraction schemas as context for the NER node.

# Schema context node
schema_node = client.create_workflow_node(
    version_id=version.id,
    name="Extraction Schemas",
    entity_type="dataset",
    entity_id=schema_dataset.id,
    config={
        "context_config": {
            "dataset_version_id": schema_version.id,
            "field_mapping": {
                "document_type": "document_type",
                "entities": "entities"
            },
            "context_name": "schemas"
        }
    }
)

Step 6: Add LLM Entity Extraction Node

Extract entities based on document type and schema.

# Entity extraction node
extract_node = client.create_workflow_node(
    version_id=version.id,
    name="Extract Entities",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """
You are a document data extraction specialist. Extract structured information from this document.

**Document Type:** {{Classify Document}}

**Document Text:**
{{Extract Text (OCR)}}

**Extraction Schema for this document type:**
{{#each schemas}}
{{#if (eq document_type ../Classify Document.document_type)}}
Extract these entities:
{{#each entities}}
- **{{name}}** ({{type}}): {{description}}
{{/each}}
{{/if}}
{{/each}}

**Instructions:**
1. Extract ONLY the entities defined in the schema above
2. Use null for any entity not found in the document
3. For arrays, return empty array [] if none found
4. For currency values, extract as numbers (e.g., 1500.00, not "$1,500")
5. For dates, use ISO format (YYYY-MM-DD)
6. Be precise - extract exactly what's in the document, don't infer

**Return valid JSON matching the schema:**
```json
{
  "document_type": "...",
  "extraction_confidence": 0.0-1.0,
  "entities": {
    // extracted entity values here
  },
  "extraction_notes": "any issues or ambiguities"
}

“”", “config”: { “temperature”: 0.0, # Zero temperature for consistent extraction “max_tokens”: 4000 } } )

Connect OCR → Extraction

client.create_workflow_edge( version_id=version.id, begin_node_id=ocr_node.id, end_node_id=extract_node.id, edge_type=“data” )

Connect Classification → Extraction

client.create_workflow_edge( version_id=version.id, begin_node_id=classify_node.id, end_node_id=extract_node.id, edge_type=“data” )

Connect Schema → Extraction (context)

client.create_workflow_edge( version_id=version.id, begin_node_id=schema_node.id, end_node_id=extract_node.id, edge_type=“context” )


## Step 7: Add Validation Node

Validate extracted entities and flag issues.

```python
# Validation node
validate_node = client.create_workflow_node(
    version_id=version.id,
    name="Validate Extraction",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """
Validate the extracted data against the original document.

**Original Document Text:**
{{Extract Text (OCR)}}

**Extracted Data:**
{{Extract Entities}}

**Validation Checks:**
1. **Completeness**: Are all required fields extracted?
2. **Accuracy**: Do extracted values match the document?
3. **Format**: Are dates, currencies, numbers in correct format?
4. **Consistency**: Do calculated fields match (e.g., line items sum to subtotal)?
5. **Anomalies**: Any unusual values or potential errors?

**Return validation result as JSON:**
```json
{
  "is_valid": true|false,
  "completeness_score": 0.0-1.0,
  "accuracy_score": 0.0-1.0,
  "issues": [
    {
      "field": "field_name",
      "issue_type": "missing|incorrect|format_error|inconsistent",
      "description": "what's wrong",
      "suggested_fix": "correction if possible"
    }
  ],
  "corrected_entities": {
    // Only include fields that need correction
  },
  "requires_human_review": true|false,
  "review_reason": "why human review needed, if applicable"
}

“”", “config”: { “temperature”: 0.1, “max_tokens”: 2000 } } )

Connect OCR → Validation (for reference)

client.create_workflow_edge( version_id=version.id, begin_node_id=ocr_node.id, end_node_id=validate_node.id, edge_type=“data” )

Connect Extraction → Validation

client.create_workflow_edge( version_id=version.id, begin_node_id=extract_node.id, end_node_id=validate_node.id, edge_type=“data” )


## Step 8: Add Output Dataset

Store processed documents with extracted entities.

```python
# Create output dataset
output_dataset = client.create_dataset(
    name="Processed Documents",
    description="Documents with extracted entities"
)

output_version = client.create_dataset_version(
    dataset_id=output_dataset.id,
    name="v1"
)

# Output node
output_node = client.create_workflow_node(
    version_id=version.id,
    name="Store Results",
    entity_type="dataset",
    entity_id=output_dataset.id,
    config={
        "output_dataset_id": output_dataset.id,
        "output_version_id": output_version.id,
        "column_mapping": {
            "source_file": "{{input}}",
            "ocr_text": "{{Extract Text (OCR)}}",
            "document_type": "{{Classify Document}}",
            "extracted_entities": "{{Extract Entities}}",
            "validation_result": "{{Validate Extraction}}",
            "processed_at": "{{timestamp}}"
        }
    }
)

# Connect Validation → Output
client.create_workflow_edge(
    version_id=version.id,
    begin_node_id=validate_node.id,
    end_node_id=output_node.id,
    edge_type="data"
)

Step 9 (Optional): Add Graph Storage

Store entities and relationships in a knowledge graph.

# Create graph for entity relationships
doc_graph = client.create_graph(
    name="Document Entities",
    description="Entities extracted from documents and their relationships"
)

# Define node types
client.create_graph_node_type(
    graph_id=doc_graph.id,
    name="Organization",
    properties=["name", "address", "type"]
)

client.create_graph_node_type(
    graph_id=doc_graph.id,
    name="Person",
    properties=["name", "role", "email"]
)

client.create_graph_node_type(
    graph_id=doc_graph.id,
    name="Document",
    properties=["type", "date", "reference_number", "amount"]
)

# Define edge types
client.create_graph_edge_type(
    graph_id=doc_graph.id,
    name="ISSUED_BY",
    from_type="Document",
    to_type="Organization"
)

client.create_graph_edge_type(
    graph_id=doc_graph.id,
    name="SENT_TO",
    from_type="Document",
    to_type="Organization"
)

client.create_graph_edge_type(
    graph_id=doc_graph.id,
    name="SIGNED_BY",
    from_type="Document",
    to_type="Person"
)

# Graph storage node
graph_node = client.create_workflow_node(
    version_id=version.id,
    name="Store in Graph",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """
Convert extracted document data into graph operations.

**Extracted Data:**
{{Extract Entities}}

**Document Type:** {{Classify Document}}

**Generate graph operations as JSON:**
```json
{
  "nodes": [
    {
      "type": "Organization|Person|Document",
      "id": "unique_identifier",
      "properties": {...}
    }
  ],
  "edges": [
    {
      "type": "ISSUED_BY|SENT_TO|SIGNED_BY",
      "from_id": "node_id",
      "to_id": "node_id"
    }
  ]
}

Rules:

  • Create Organization nodes for vendors, customers, parties
  • Create Person nodes for signatories, contacts
  • Create Document node for the document itself
  • Link with appropriate edge types “”", “config”: { “temperature”: 0.0, “max_tokens”: 2000 } } )

Connect Extraction → Graph

client.create_workflow_edge( version_id=version.id, begin_node_id=extract_node.id, end_node_id=graph_node.id, edge_type=“data” )

Connect Classification → Graph

client.create_workflow_edge( version_id=version.id, begin_node_id=classify_node.id, end_node_id=graph_node.id, edge_type=“data” )


## Complete Workflow Code

```python
from seeme import Client

client = Client()

# --- Get your models ---
ocr_model = client.get_model("your-ocr-model-id")
llm_model = client.get_model("your-llm-model-id")

# --- Create workflow ---
workflow = client.create_workflow(
    name="Document Processing Pipeline",
    description="OCR → Classification → Entity Extraction → Validation → Storage"
)
version = client.create_workflow_version(workflow_id=workflow.id, name="v1")

# --- Node 1: OCR ---
ocr_node = client.create_workflow_node(
    version_id=version.id,
    name="Extract Text (OCR)",
    entity_type="model",
    entity_id=ocr_model.id,
    config={
        "input_template": "{{input}}",
        "timeout": 120,
        "config": {"languages": ["en"], "dpi": 300}
    }
)

# --- Node 2: Classification ---
classify_node = client.create_workflow_node(
    version_id=version.id,
    name="Classify Document",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """Classify this document into one category: invoice, receipt, contract, letter, report, form, medical, financial, other.

Document text:
{{Extract Text (OCR)}}

Return JSON: {"document_type": "...", "confidence": 0.0-1.0}""",
        "config": {"temperature": 0.1, "max_tokens": 200}
    }
)

# --- Node 3: Extraction Schema (context) ---
schema_node = client.create_workflow_node(
    version_id=version.id,
    name="Extraction Schemas",
    entity_type="dataset",
    entity_id=schema_dataset.id,
    config={
        "context_config": {
            "dataset_version_id": schema_version.id,
            "context_name": "schemas"
        }
    }
)

# --- Node 4: Entity Extraction ---
extract_node = client.create_workflow_node(
    version_id=version.id,
    name="Extract Entities",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """Extract entities from this {{Classify Document.document_type}} document.

Document text:
{{Extract Text (OCR)}}

Schema:
{{#each schemas}}{{#if (eq document_type ../Classify Document.document_type)}}{{entities}}{{/if}}{{/each}}

Return JSON with extracted entities.""",
        "config": {"temperature": 0.0, "max_tokens": 4000}
    }
)

# --- Node 5: Validation ---
validate_node = client.create_workflow_node(
    version_id=version.id,
    name="Validate Extraction",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """Validate extraction accuracy.

Original: {{Extract Text (OCR)}}
Extracted: {{Extract Entities}}

Return JSON: {is_valid, issues[], corrected_entities, requires_human_review}""",
        "config": {"temperature": 0.1, "max_tokens": 2000}
    }
)

# --- Node 6: Output Dataset ---
output_node = client.create_workflow_node(
    version_id=version.id,
    name="Store Results",
    entity_type="dataset",
    entity_id=output_dataset.id,
    config={
        "output_dataset_id": output_dataset.id,
        "column_mapping": {
            "source_file": "{{input}}",
            "document_type": "{{Classify Document}}",
            "extracted_entities": "{{Extract Entities}}",
            "validation": "{{Validate Extraction}}"
        }
    }
)

# --- Connect nodes ---
edges = [
    (ocr_node.id, classify_node.id, "data"),
    (ocr_node.id, extract_node.id, "data"),
    (classify_node.id, extract_node.id, "data"),
    (schema_node.id, extract_node.id, "context"),
    (ocr_node.id, validate_node.id, "data"),
    (extract_node.id, validate_node.id, "data"),
    (validate_node.id, output_node.id, "data"),
]

for begin_id, end_id, edge_type in edges:
    client.create_workflow_edge(
        version_id=version.id,
        begin_node_id=begin_id,
        end_node_id=end_id,
        edge_type=edge_type
    )

print(f"Workflow ready: {workflow.id}")

Execute the Workflow

Single Document

# Process a single document
execution = client.execute_workflow(
    workflow_id=workflow.id,
    input_mode="single",
    item="./documents/invoice_001.pdf"
)

# Monitor
import time
while execution.status in ["pending", "running"]:
    time.sleep(5)
    execution = client.get_workflow_execution(
        workflow_id=workflow.id,
        execution_id=execution.id
    )
    print(f"Status: {execution.status}")

# Results
import json
print("Document Type:", execution.results["Classify Document"])
print("Extracted Entities:", json.dumps(
    execution.results["Extract Entities"],
    indent=2
))
print("Validation:", execution.results["Validate Extraction"])

Batch Processing

import glob

# Process all documents in a folder
documents = glob.glob("./documents/**/*.pdf", recursive=True)
documents += glob.glob("./documents/**/*.jpg", recursive=True)
documents += glob.glob("./documents/**/*.png", recursive=True)

print(f"Processing {len(documents)} documents...")

execution = client.execute_workflow(
    workflow_id=workflow.id,
    input_mode="batch",
    batch_config={
        "file_paths": documents,
        "parallelism": 10,
        "continue_on_error": True
    }
)

# Monitor batch
while execution.status in ["pending", "running"]:
    time.sleep(30)
    execution = client.get_workflow_execution(
        workflow_id=workflow.id,
        execution_id=execution.id
    )
    print(f"Progress: {execution.completed}/{execution.total}")

# Summary
print(f"\nCompleted: {execution.completed}")
print(f"Failed: {execution.failed}")
print(f"Requires review: {sum(1 for r in execution.results if r.get('requires_human_review'))}")

Using Post-Processors for Automatic Processing

Set up automatic processing when documents are uploaded:

# Create post-processor to run workflow on uploads
processor = client.create_post_processor(
    dataset_id=inbox_dataset.id,
    name="Auto-Process Documents",
    processor_type="workflow",
    workflow_id=workflow.id,
    enabled=True
)

# Now any document uploaded to inbox_dataset will be processed automatically

Example Output

For an invoice document, the output might look like:

{
  "document_type": {
    "document_type": "invoice",
    "confidence": 0.95,
    "reasoning": "Contains invoice number, line items, and total amount due"
  },
  "extracted_entities": {
    "document_type": "invoice",
    "extraction_confidence": 0.92,
    "entities": {
      "invoice_number": "INV-2024-0042",
      "invoice_date": "2024-01-15",
      "due_date": "2024-02-15",
      "vendor_name": "Tech Solutions Inc.",
      "vendor_address": "456 Tech Park, San Francisco, CA 94102",
      "customer_name": "Acme Corporation",
      "customer_address": "123 Business Street, New York, NY 10001",
      "line_items": [
        {"description": "Consulting Services", "quantity": 10, "unit_price": 150.00, "total": 1500.00},
        {"description": "Software License", "quantity": 1, "unit_price": 500.00, "total": 500.00}
      ],
      "subtotal": 2000.00,
      "tax_rate": 0.08,
      "tax_amount": 160.00,
      "total_amount": 2160.00,
      "payment_terms": "Net 30",
      "bank_details": null
    }
  },
  "validation": {
    "is_valid": true,
    "completeness_score": 0.93,
    "accuracy_score": 0.98,
    "issues": [
      {
        "field": "bank_details",
        "issue_type": "missing",
        "description": "No bank/payment details found in document"
      }
    ],
    "requires_human_review": false
  }
}

Workflow Diagram

graph TD
    A[Document Upload] --> B[OCR: Extract Text]
    B --> C[LLM: Classify Document]
    B --> D[LLM: Extract Entities]
    C --> D
    E[Extraction Schemas] -.->|context| D
    B --> F[LLM: Validate]
    D --> F
    F --> G[Output Dataset]
    D --> H[Knowledge Graph]
    C --> H

    style A fill:#e0f2fe
    style B fill:#fef3c7
    style C fill:#dbeafe
    style D fill:#d1fae5
    style E fill:#f3f4f6
    style F fill:#fce7f3
    style G fill:#e0f2fe
    style H fill:#ede9fe

Customization

Add Language Detection

lang_detect_node = client.create_workflow_node(
    version_id=version.id,
    name="Detect Language",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """Detect the language of this text. Return JSON: {"language": "en|de|fr|es|...", "confidence": 0.0-1.0}

Text:
{{Extract Text (OCR)}}"""
    }
)

Add Translation Step

translate_node = client.create_workflow_node(
    version_id=version.id,
    name="Translate to English",
    entity_type="model",
    entity_id=llm_model.id,
    config={
        "input_template": """Translate this document to English. Preserve formatting.

Original ({{Detect Language.language}}):
{{Extract Text (OCR)}}

English translation:"""
    }
)

Custom Document Types

Add new document types to the schema dataset:

custom_schema = {
    "document_type": "purchase_order",
    "entities": [
        {"name": "po_number", "type": "string", "description": "Purchase order number"},
        {"name": "order_date", "type": "date", "description": "Date PO was created"},
        {"name": "delivery_date", "type": "date", "description": "Expected delivery date"},
        {"name": "buyer", "type": "object", "description": "{name, contact, address}"},
        {"name": "supplier", "type": "object", "description": "{name, contact, address}"},
        {"name": "items", "type": "array", "description": "List of ordered items"},
        {"name": "total_value", "type": "currency", "description": "Total PO value"},
        {"name": "shipping_terms", "type": "string", "description": "Shipping/delivery terms"},
        {"name": "payment_terms", "type": "string", "description": "Payment conditions"}
    ]
}

client.create_dataset_item(
    version_id=schema_version.id,
    data=custom_schema
)

Best Practices

  1. Use high DPI (300+) for OCR on scanned documents
  2. Zero temperature for entity extraction ensures consistency
  3. Validate everything - LLMs can hallucinate fields
  4. Schema-driven extraction makes adding new document types easy
  5. Batch similar documents for better throughput
  6. Flag for human review when confidence is low or validation fails
  7. Store original text alongside extractions for auditing

Troubleshooting

IssueSolution
Poor OCR qualityIncrease DPI, try different OCR model, preprocess images
Wrong classificationAdd more document type examples, improve prompts
Missing entitiesCheck schema completeness, lower confidence requirements
Incorrect extractionsReview OCR output, improve extraction prompts
Validation false positivesTune validation prompt, accept minor format differences

Related Guides