Post-Processors
Post-Processors
Post-processors automatically process data when it’s uploaded to a dataset. Configure AI models to run on every new item—transcribe audio, extract entities, classify documents, and more.
What Are Post-Processors?
Post-processors are AI pipelines that trigger automatically when data enters a dataset. They’re ideal for:
- Automated transcription - Convert audio/video to text
- Entity extraction - Pull names, dates, amounts from documents
- Classification - Auto-categorize incoming data
- Data enrichment - Add AI-generated metadata
graph LR
A[Upload Data] --> B[Post-Processor Queue]
B --> C[AI Model]
C --> D[Store Results]
D --> E[Annotations/Text]Quick Start
Create a Post-Processor
Upload Data
When you upload files, post-processors run automatically:
# Upload an audio file
item = client.create_dataset_item(
version_id=version.id,
split_id=split.id,
item="./meeting.mp3"
)
# Post-processor runs in background
# Check status
jobs = client.get_post_processor_jobs(
dataset_id=dataset.id,
item_id=item.id
)
for job in jobs:
print(f"{job.processor_name}: {job.status}")Processor Types
| Type | Code | Input | Output |
|---|---|---|---|
| Speech-to-Text | stt | Audio/Video | Text transcription |
| Speech Diarization | stt-diarization | Audio | Speaker-labeled transcript |
| Classification | classification | Any | Labels/Annotations |
| Object Detection | detection | Images | Bounding boxes |
| Named Entity Recognition | ner | Text | Entity annotations |
| OCR | ocr | Images/PDF | Extracted text |
| LLM Extraction | llm | Any | Custom extracted data |
Processor Configuration
Basic Configuration
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Document Classifier",
model_type="classification",
model_id=classifier_model.id,
# Where to store results
output_target="annotations", # "text", "annotations", or "both"
# Filtering
confidence_threshold=0.7, # Ignore low-confidence results
# Label management
auto_create_labels=True, # Create new labels from predictions
# Execution order (if multiple processors)
order=1,
# Enable/disable
enabled=True
)Output Targets
| Target | Description | Use Case |
|---|---|---|
text | Store in item’s text field | Transcription, OCR |
annotations | Create label annotations | Classification, NER |
both | Both text and annotations | OCR with entity extraction |
Using External Providers
Use OpenAI, Anthropic, or other providers:
processor = client.create_post_processor(
dataset_id=dataset.id,
name="GPT Extraction",
model_type="llm",
# External provider instead of internal model
external_provider="openai",
external_model="gpt-4",
external_config={
"api_key": "your-api-key",
"temperature": 0.3
},
# Custom prompt
prompt="""
Extract the following from this document:
- Customer name
- Order number
- Total amount
Return as JSON.
""",
output_target="text"
)Post-Processor Sections
Common Use Cases
Audio/Video Library
Automatically transcribe all uploaded media:
# STT processor for audio
stt_processor = client.create_post_processor(
dataset_id=media_dataset.id,
name="Transcribe Media",
model_type="stt",
model_id=whisper_model.id,
output_target="text"
)
# NER processor to extract mentions (runs after STT)
ner_processor = client.create_post_processor(
dataset_id=media_dataset.id,
name="Extract Mentions",
model_type="ner",
model_id=ner_model.id,
output_target="annotations",
order=2 # Run after STT
)Document Intake
Process incoming documents automatically:
# OCR for scanned documents
ocr_processor = client.create_post_processor(
dataset_id=docs_dataset.id,
name="Extract Text",
model_type="ocr",
model_id=ocr_model.id,
output_target="text",
order=1
)
# Classification
classify_processor = client.create_post_processor(
dataset_id=docs_dataset.id,
name="Classify Document",
model_type="classification",
model_id=doc_classifier.id,
output_target="annotations",
auto_create_labels=True,
order=2
)
# LLM extraction
extract_processor = client.create_post_processor(
dataset_id=docs_dataset.id,
name="Extract Fields",
model_type="llm",
model_id=llm_model.id,
prompt="Extract: date, amount, vendor name. Return JSON.",
output_target="text",
order=3
)Quality Inspection
Auto-classify defects in product images:
processor = client.create_post_processor(
dataset_id=inspection_dataset.id,
name="Defect Detection",
model_type="detection",
model_id=defect_model.id,
output_target="annotations",
confidence_threshold=0.8,
auto_create_labels=True
)Monitoring Processing
Check Job Status
# Get all jobs for a dataset
jobs = client.get_post_processor_jobs(
dataset_id=dataset.id,
status="pending" # or "processing", "completed", "failed"
)
for job in jobs:
print(f"Item {job.item_id}: {job.status}")
if job.error:
print(f" Error: {job.error}")Retry Failed Jobs
# Retry a specific failed job
client.retry_post_processor_job(
dataset_id=dataset.id,
item_id=item.id,
job_id=job.id
)
# Retry all failed jobs for a processor
failed_jobs = client.get_post_processor_jobs(
dataset_id=dataset.id,
processor_id=processor.id,
status="failed"
)
for job in failed_jobs:
client.retry_post_processor_job(
dataset_id=dataset.id,
item_id=job.item_id,
job_id=job.id
)Best Practices
- Order matters - Set
orderwhen chaining processors - Set confidence thresholds - Filter out low-quality predictions
- Monitor failures - Check job status regularly
- Use appropriate models - Match model type to your data
- Test before enabling - Verify results on sample data first
Integration with Workflows
Post-processors work independently, but you can also trigger workflows:
# Create workflow that processes post-processor results
workflow = client.create_workflow(
name="Process Transcriptions",
trigger={
"type": "post_processor_complete",
"dataset_id": dataset.id,
"processor_type": "stt"
}
)