Processor Types
Processor Types
Detailed documentation for each post-processor type.
Speech-to-Text (STT)
Convert audio and video files to text transcriptions.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Audio Transcription",
model_type="stt",
model_id=stt_model.id,
output_target="text",
config={
"language": "en", # Optional: force language
"timestamps": True # Include timestamps
}
)Output Format
{
"text": "Hello, this is the meeting transcript...",
"segments": [
{
"text": "Hello, this is",
"start": 0.0,
"end": 1.5,
"confidence": 0.95
},
{
"text": "the meeting transcript",
"start": 1.5,
"end": 3.2,
"confidence": 0.92
}
],
"language": "en",
"duration": 125.4
}Supported Formats
- Audio: MP3, WAV, M4A, FLAC, OGG
- Video: MP4, MOV, AVI, MKV (audio track extracted)
Speech Diarization
Transcribe with speaker identification.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Speaker Diarization",
model_type="stt-diarization",
model_id=diarization_model.id,
output_target="text",
config={
"num_speakers": None, # Auto-detect, or set number
"min_speakers": 2,
"max_speakers": 10
}
)Output Format
{
"segments": [
{
"speaker": "SPEAKER_00",
"text": "Welcome to the meeting.",
"start": 0.0,
"end": 2.1,
"confidence": 0.89
},
{
"speaker": "SPEAKER_01",
"text": "Thank you for having me.",
"start": 2.3,
"end": 4.5,
"confidence": 0.92
}
],
"speakers": ["SPEAKER_00", "SPEAKER_01"]
}Classification
Automatically classify items into categories.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Document Classification",
model_type="classification",
model_id=classifier_model.id,
output_target="annotations",
confidence_threshold=0.7,
auto_create_labels=True
)Output Format
{
"predictions": [
{
"label": "invoice",
"confidence": 0.94
},
{
"label": "receipt",
"confidence": 0.05
},
{
"label": "contract",
"confidence": 0.01
}
]
}How It Works
- Model runs inference on item
- Top prediction(s) above threshold are kept
- If
auto_create_labels=True, new labels created automatically - Annotations created linking item to label(s)
Object Detection
Detect and locate objects in images.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Defect Detection",
model_type="detection",
model_id=detector_model.id,
output_target="annotations",
confidence_threshold=0.5,
auto_create_labels=True,
config={
"nms_threshold": 0.4, # Non-max suppression
"max_detections": 100
}
)Output Format
{
"detections": [
{
"label": "scratch",
"confidence": 0.87,
"x": 0.25,
"y": 0.30,
"width": 0.15,
"height": 0.08
},
{
"label": "dent",
"confidence": 0.72,
"x": 0.60,
"y": 0.55,
"width": 0.20,
"height": 0.18
}
],
"image_width": 1920,
"image_height": 1080
}Annotation Format
Bounding boxes stored in YOLO format: x_center y_center width height (normalized 0-1).
Named Entity Recognition (NER)
Extract named entities from text.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Entity Extraction",
model_type="ner",
model_id=ner_model.id,
output_target="annotations",
auto_create_labels=True,
config={
"entity_types": ["PERSON", "ORG", "DATE", "MONEY"] # Optional filter
}
)Output Format
{
"text": "John Smith from Acme Corp signed on January 15, 2024 for $50,000.",
"entities": [
{
"text": "John Smith",
"label": "PERSON",
"start": 0,
"end": 10,
"confidence": 0.95
},
{
"text": "Acme Corp",
"label": "ORG",
"start": 16,
"end": 25,
"confidence": 0.92
},
{
"text": "January 15, 2024",
"label": "DATE",
"start": 36,
"end": 52,
"confidence": 0.98
},
{
"text": "$50,000",
"label": "MONEY",
"start": 57,
"end": 64,
"confidence": 0.96
}
]
}Annotation Format
Entity positions stored as character offsets: start end.
OCR (Optical Character Recognition)
Extract text from images and PDFs.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Document OCR",
model_type="ocr",
model_id=ocr_model.id,
output_target="text", # or "both" to also get word positions
config={
"languages": ["en", "de"], # Expected languages
"dpi": 300 # For PDF rendering
}
)Output Format
{
"text": "INVOICE\n\nInvoice Number: INV-2024-001\nDate: January 15, 2024\n\nBill To:\nAcme Corporation\n123 Main Street...",
"pages": [
{
"page_number": 1,
"text": "INVOICE\n\nInvoice Number: INV-2024-001...",
"confidence": 0.94,
"words": [
{
"text": "INVOICE",
"x": 0.4,
"y": 0.05,
"width": 0.2,
"height": 0.03,
"confidence": 0.99
}
]
}
]
}Supported Formats
- Images: JPG, PNG, TIFF, BMP
- Documents: PDF (multi-page supported)
LLM Extraction
Use language models for custom extraction tasks.
Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Invoice Field Extraction",
model_type="llm",
model_id=llm_model.id,
output_target="text",
# Custom extraction prompt
prompt="""
Extract the following fields from this document:
- invoice_number
- date
- vendor_name
- total_amount
- line_items (array of {description, quantity, price})
Return as valid JSON only, no explanation.
""",
config={
"temperature": 0.1, # Low for consistent output
"max_tokens": 1000
}
)Using External Providers
processor = client.create_post_processor(
dataset_id=dataset.id,
name="GPT-4 Extraction",
model_type="llm",
external_provider="openai",
external_model="gpt-4-turbo",
external_config={
"api_key": "sk-...",
"temperature": 0.2
},
prompt="Extract customer name and order details. Return JSON.",
output_target="text"
)Output Format
Returns whatever the LLM generates based on your prompt:
{
"invoice_number": "INV-2024-001",
"date": "2024-01-15",
"vendor_name": "Acme Supplies Inc.",
"total_amount": 1250.00,
"line_items": [
{
"description": "Widget A",
"quantity": 10,
"price": 50.00
},
{
"description": "Widget B",
"quantity": 15,
"price": 50.00
}
]
}Type Comparison
| Type | Input | Output | Auto Labels | Best For |
|---|---|---|---|---|
stt | Audio/Video | Text | No | Transcription |
stt-diarization | Audio | Text + Speakers | No | Meeting recordings |
classification | Any | Labels | Yes | Categorization |
detection | Images | Bounding boxes | Yes | Object location |
ner | Text | Entity spans | Yes | Information extraction |
ocr | Images/PDF | Text | No | Document digitization |
llm | Any | Custom JSON | No | Complex extraction |
Choosing the Right Type
graph TD
A{What's your input?} --> B[Audio/Video]
A --> C[Images]
A --> D[Text]
A --> E[Documents/PDF]
B --> F{Need speakers?}
F -->|Yes| G[stt-diarization]
F -->|No| H[stt]
C --> I{What do you need?}
I -->|Categorize| J[classification]
I -->|Find objects| K[detection]
I -->|Extract text| L[ocr]
D --> M{What do you need?}
M -->|Categorize| J
M -->|Find entities| N[ner]
M -->|Custom extraction| O[llm]
E --> P[ocr → then ner or llm]