Post-Processors

Post-Processors

Post-processors automatically process data when it’s uploaded to a dataset. Configure AI models to run on every new item—transcribe audio, extract entities, classify documents, and more.

What Are Post-Processors?

Post-processors are AI pipelines that trigger automatically when data enters a dataset. They’re ideal for:

  • Automated transcription - Convert audio/video to text
  • Entity extraction - Pull names, dates, amounts from documents
  • Classification - Auto-categorize incoming data
  • Data enrichment - Add AI-generated metadata
graph LR
    A[Upload Data] --> B[Post-Processor Queue]
    B --> C[AI Model]
    C --> D[Store Results]
    D --> E[Annotations/Text]

Quick Start

Create a Post-Processor

Upload Data

When you upload files, post-processors run automatically:

# Upload an audio file
item = client.create_dataset_item(
    version_id=version.id,
    split_id=split.id,
    item="./meeting.mp3"
)

# Post-processor runs in background
# Check status
jobs = client.get_post_processor_jobs(
    dataset_id=dataset.id,
    item_id=item.id
)

for job in jobs:
    print(f"{job.processor_name}: {job.status}")

Processor Types

TypeCodeInputOutput
Speech-to-TextsttAudio/VideoText transcription
Speech Diarizationstt-diarizationAudioSpeaker-labeled transcript
ClassificationclassificationAnyLabels/Annotations
Object DetectiondetectionImagesBounding boxes
Named Entity RecognitionnerTextEntity annotations
OCRocrImages/PDFExtracted text
LLM ExtractionllmAnyCustom extracted data

Processor Configuration

Basic Configuration

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Document Classifier",
    model_type="classification",
    model_id=classifier_model.id,

    # Where to store results
    output_target="annotations",  # "text", "annotations", or "both"

    # Filtering
    confidence_threshold=0.7,  # Ignore low-confidence results

    # Label management
    auto_create_labels=True,  # Create new labels from predictions

    # Execution order (if multiple processors)
    order=1,

    # Enable/disable
    enabled=True
)

Output Targets

TargetDescriptionUse Case
textStore in item’s text fieldTranscription, OCR
annotationsCreate label annotationsClassification, NER
bothBoth text and annotationsOCR with entity extraction

Using External Providers

Use OpenAI, Anthropic, or other providers:

processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="GPT Extraction",
    model_type="llm",

    # External provider instead of internal model
    external_provider="openai",
    external_model="gpt-4",
    external_config={
        "api_key": "your-api-key",
        "temperature": 0.3
    },

    # Custom prompt
    prompt="""
    Extract the following from this document:
    - Customer name
    - Order number
    - Total amount

    Return as JSON.
    """,

    output_target="text"
)

Post-Processor Sections

Common Use Cases

Audio/Video Library

Automatically transcribe all uploaded media:

# STT processor for audio
stt_processor = client.create_post_processor(
    dataset_id=media_dataset.id,
    name="Transcribe Media",
    model_type="stt",
    model_id=whisper_model.id,
    output_target="text"
)

# NER processor to extract mentions (runs after STT)
ner_processor = client.create_post_processor(
    dataset_id=media_dataset.id,
    name="Extract Mentions",
    model_type="ner",
    model_id=ner_model.id,
    output_target="annotations",
    order=2  # Run after STT
)

Document Intake

Process incoming documents automatically:

# OCR for scanned documents
ocr_processor = client.create_post_processor(
    dataset_id=docs_dataset.id,
    name="Extract Text",
    model_type="ocr",
    model_id=ocr_model.id,
    output_target="text",
    order=1
)

# Classification
classify_processor = client.create_post_processor(
    dataset_id=docs_dataset.id,
    name="Classify Document",
    model_type="classification",
    model_id=doc_classifier.id,
    output_target="annotations",
    auto_create_labels=True,
    order=2
)

# LLM extraction
extract_processor = client.create_post_processor(
    dataset_id=docs_dataset.id,
    name="Extract Fields",
    model_type="llm",
    model_id=llm_model.id,
    prompt="Extract: date, amount, vendor name. Return JSON.",
    output_target="text",
    order=3
)

Quality Inspection

Auto-classify defects in product images:

processor = client.create_post_processor(
    dataset_id=inspection_dataset.id,
    name="Defect Detection",
    model_type="detection",
    model_id=defect_model.id,
    output_target="annotations",
    confidence_threshold=0.8,
    auto_create_labels=True
)

Monitoring Processing

Check Job Status

# Get all jobs for a dataset
jobs = client.get_post_processor_jobs(
    dataset_id=dataset.id,
    status="pending"  # or "processing", "completed", "failed"
)

for job in jobs:
    print(f"Item {job.item_id}: {job.status}")
    if job.error:
        print(f"  Error: {job.error}")

Retry Failed Jobs

# Retry a specific failed job
client.retry_post_processor_job(
    dataset_id=dataset.id,
    item_id=item.id,
    job_id=job.id
)

# Retry all failed jobs for a processor
failed_jobs = client.get_post_processor_jobs(
    dataset_id=dataset.id,
    processor_id=processor.id,
    status="failed"
)

for job in failed_jobs:
    client.retry_post_processor_job(
        dataset_id=dataset.id,
        item_id=job.item_id,
        job_id=job.id
    )

Best Practices

  1. Order matters - Set order when chaining processors
  2. Set confidence thresholds - Filter out low-quality predictions
  3. Monitor failures - Check job status regularly
  4. Use appropriate models - Match model type to your data
  5. Test before enabling - Verify results on sample data first

Integration with Workflows

Post-processors work independently, but you can also trigger workflows:

# Create workflow that processes post-processor results
workflow = client.create_workflow(
    name="Process Transcriptions",
    trigger={
        "type": "post_processor_complete",
        "dataset_id": dataset.id,
        "processor_type": "stt"
    }
)