Processor Types

Processor Types

Detailed documentation for each post-processor type.

Speech-to-Text (STT)

Convert audio and video files to text transcriptions.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Audio Transcription",
    model_type="stt",
    model_id=stt_model.id,
    output_target="text",
    config={
        "language": "en",  # Optional: force language
        "timestamps": True  # Include timestamps
    }
)

Output Format

{
  "text": "Hello, this is the meeting transcript...",
  "segments": [
    {
      "text": "Hello, this is",
      "start": 0.0,
      "end": 1.5,
      "confidence": 0.95
    },
    {
      "text": "the meeting transcript",
      "start": 1.5,
      "end": 3.2,
      "confidence": 0.92
    }
  ],
  "language": "en",
  "duration": 125.4
}

Supported Formats

  • Audio: MP3, WAV, M4A, FLAC, OGG
  • Video: MP4, MOV, AVI, MKV (audio track extracted)

Speech Diarization

Transcribe with speaker identification.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Speaker Diarization",
    model_type="stt-diarization",
    model_id=diarization_model.id,
    output_target="text",
    config={
        "num_speakers": None,  # Auto-detect, or set number
        "min_speakers": 2,
        "max_speakers": 10
    }
)

Output Format

{
  "segments": [
    {
      "speaker": "SPEAKER_00",
      "text": "Welcome to the meeting.",
      "start": 0.0,
      "end": 2.1,
      "confidence": 0.89
    },
    {
      "speaker": "SPEAKER_01",
      "text": "Thank you for having me.",
      "start": 2.3,
      "end": 4.5,
      "confidence": 0.92
    }
  ],
  "speakers": ["SPEAKER_00", "SPEAKER_01"]
}

Classification

Automatically classify items into categories.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Document Classification",
    model_type="classification",
    model_id=classifier_model.id,
    output_target="annotations",
    confidence_threshold=0.7,
    auto_create_labels=True
)

Output Format

{
  "predictions": [
    {
      "label": "invoice",
      "confidence": 0.94
    },
    {
      "label": "receipt",
      "confidence": 0.05
    },
    {
      "label": "contract",
      "confidence": 0.01
    }
  ]
}

How It Works

  1. Model runs inference on item
  2. Top prediction(s) above threshold are kept
  3. If auto_create_labels=True, new labels created automatically
  4. Annotations created linking item to label(s)

Object Detection

Detect and locate objects in images.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Defect Detection",
    model_type="detection",
    model_id=detector_model.id,
    output_target="annotations",
    confidence_threshold=0.5,
    auto_create_labels=True,
    config={
        "nms_threshold": 0.4,  # Non-max suppression
        "max_detections": 100
    }
)

Output Format

{
  "detections": [
    {
      "label": "scratch",
      "confidence": 0.87,
      "x": 0.25,
      "y": 0.30,
      "width": 0.15,
      "height": 0.08
    },
    {
      "label": "dent",
      "confidence": 0.72,
      "x": 0.60,
      "y": 0.55,
      "width": 0.20,
      "height": 0.18
    }
  ],
  "image_width": 1920,
  "image_height": 1080
}

Annotation Format

Bounding boxes stored in YOLO format: x_center y_center width height (normalized 0-1).


Named Entity Recognition (NER)

Extract named entities from text.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Entity Extraction",
    model_type="ner",
    model_id=ner_model.id,
    output_target="annotations",
    auto_create_labels=True,
    config={
        "entity_types": ["PERSON", "ORG", "DATE", "MONEY"]  # Optional filter
    }
)

Output Format

{
  "text": "John Smith from Acme Corp signed on January 15, 2024 for $50,000.",
  "entities": [
    {
      "text": "John Smith",
      "label": "PERSON",
      "start": 0,
      "end": 10,
      "confidence": 0.95
    },
    {
      "text": "Acme Corp",
      "label": "ORG",
      "start": 16,
      "end": 25,
      "confidence": 0.92
    },
    {
      "text": "January 15, 2024",
      "label": "DATE",
      "start": 36,
      "end": 52,
      "confidence": 0.98
    },
    {
      "text": "$50,000",
      "label": "MONEY",
      "start": 57,
      "end": 64,
      "confidence": 0.96
    }
  ]
}

Annotation Format

Entity positions stored as character offsets: start end.


OCR (Optical Character Recognition)

Extract text from images and PDFs.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Document OCR",
    model_type="ocr",
    model_id=ocr_model.id,
    output_target="text",  # or "both" to also get word positions
    config={
        "languages": ["en", "de"],  # Expected languages
        "dpi": 300  # For PDF rendering
    }
)

Output Format

{
  "text": "INVOICE\n\nInvoice Number: INV-2024-001\nDate: January 15, 2024\n\nBill To:\nAcme Corporation\n123 Main Street...",
  "pages": [
    {
      "page_number": 1,
      "text": "INVOICE\n\nInvoice Number: INV-2024-001...",
      "confidence": 0.94,
      "words": [
        {
          "text": "INVOICE",
          "x": 0.4,
          "y": 0.05,
          "width": 0.2,
          "height": 0.03,
          "confidence": 0.99
        }
      ]
    }
  ]
}

Supported Formats

  • Images: JPG, PNG, TIFF, BMP
  • Documents: PDF (multi-page supported)

LLM Extraction

Use language models for custom extraction tasks.

Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Invoice Field Extraction",
    model_type="llm",
    model_id=llm_model.id,
    output_target="text",

    # Custom extraction prompt
    prompt="""
    Extract the following fields from this document:
    - invoice_number
    - date
    - vendor_name
    - total_amount
    - line_items (array of {description, quantity, price})

    Return as valid JSON only, no explanation.
    """,

    config={
        "temperature": 0.1,  # Low for consistent output
        "max_tokens": 1000
    }
)

Using External Providers

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="GPT-4 Extraction",
    model_type="llm",
    external_provider="openai",
    external_model="gpt-4-turbo",
    external_config={
        "api_key": "sk-...",
        "temperature": 0.2
    },
    prompt="Extract customer name and order details. Return JSON.",
    output_target="text"
)

Output Format

Returns whatever the LLM generates based on your prompt:

{
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "vendor_name": "Acme Supplies Inc.",
  "total_amount": 1250.00,
  "line_items": [
    {
      "description": "Widget A",
      "quantity": 10,
      "price": 50.00
    },
    {
      "description": "Widget B",
      "quantity": 15,
      "price": 50.00
    }
  ]
}

Type Comparison

TypeInputOutputAuto LabelsBest For
sttAudio/VideoTextNoTranscription
stt-diarizationAudioText + SpeakersNoMeeting recordings
classificationAnyLabelsYesCategorization
detectionImagesBounding boxesYesObject location
nerTextEntity spansYesInformation extraction
ocrImages/PDFTextNoDocument digitization
llmAnyCustom JSONNoComplex extraction

Choosing the Right Type

graph TD
    A{What's your input?} --> B[Audio/Video]
    A --> C[Images]
    A --> D[Text]
    A --> E[Documents/PDF]

    B --> F{Need speakers?}
    F -->|Yes| G[stt-diarization]
    F -->|No| H[stt]

    C --> I{What do you need?}
    I -->|Categorize| J[classification]
    I -->|Find objects| K[detection]
    I -->|Extract text| L[ocr]

    D --> M{What do you need?}
    M -->|Categorize| J
    M -->|Find entities| N[ner]
    M -->|Custom extraction| O[llm]

    E --> P[ocr → then ner or llm]

Next Steps