Automated Labeling

Automated Labeling

Labeling data is the biggest bottleneck in machine learning. Post-processors eliminate most of that manual work by using existing AI models to pre-label your data automatically.

The Problem

Manual labeling is slow and expensive:

Labeling MethodSpeedCostQuality
Fully manual~100 items/hourHighHigh (if expert)
Post-processor + review~1000 items/hourLowHigh
Post-processor onlyUnlimitedMinimalModerate

The sweet spot: use post-processors to generate labels, then have humans review and correct them. This is 10x faster than labeling from scratch.

How It Works

graph LR
    A[Upload Raw Data] --> B[Post-Processor Runs]
    B --> C[Auto-Generated Labels]
    C --> D[Human Review]
    D --> E[Corrected Labels]
    E --> F[Training-Ready Dataset]
  1. Upload your unlabeled data to a dataset
  2. Post-processors run automatically on every item
  3. Labels appear as annotations on each item
  4. Review and correct mistakes in the annotation interface
  5. Train on the corrected dataset

Setup

Step 1: Choose Your Labeling Strategy

Pick the right post-processor type for your task:

TaskPost-Processor TypeWhat It Produces
Categorize imagesclassificationCategory labels
Find objects in imagesdetectionBounding boxes
Extract entities from textnerEntity spans
Classify text/documentsclassificationCategory labels
Extract structured datallmCustom fields
Read text from imagesocrText content

Step 2: Create the Post-Processor

Step 3: Upload Data

Upload your unlabeled data. Post-processors run automatically:

import glob

# Upload images
for image_path in glob.glob("./unlabeled_images/*.jpg"):
    client.create_dataset_item(
        version_id=version.id,
        split_id=split.id,
        file_path=image_path
    )

# Monitor processing
jobs = client.get_post_processor_jobs(
    dataset_id=dataset.id,
    status="pending"
)
print(f"{len(jobs)} items queued for labeling")

Step 4: Review and Correct

After processing completes, review the auto-generated labels:

Using an LLM as a Labeling Oracle

Large language models can label data with remarkable accuracy, especially for tasks that benefit from reasoning:

# Use a large model (e.g., Ollama-hosted LLM) as a labeling oracle
llm_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="LLM Oracle",
    model_type="llm",
    model_id=large_llm.id,
    prompt="""
    You are an expert annotator. Look at this image and:
    1. Identify what the main subject is
    2. Classify it into one of these categories:
       {label_list}
    3. Rate your confidence (low/medium/high)

    Return JSON: {"label": "...", "confidence": "..."}
    """,
    output_target="annotations",
    auto_create_labels=True,
    enabled=True
)

This approach is central to Model Distillation—the LLM generates the training data, and you train a smaller, faster model on those labels.

Chaining Post-Processors for Complex Labeling

For multi-step labeling tasks, chain processors in sequence:

# Step 1: OCR to extract text from documents
ocr_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Extract Text",
    model_type="ocr",
    model_id=ocr_model.id,
    output_target="text",
    order=1
)

# Step 2: NER to find entities in extracted text
ner_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Find Entities",
    model_type="ner",
    model_id=ner_model.id,
    output_target="annotations",
    auto_create_labels=True,
    order=2
)

# Step 3: LLM to classify based on content
classify_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Classify Document",
    model_type="llm",
    model_id=llm_model.id,
    prompt="Based on this document, classify it as: invoice, contract, letter, or report. Return only the category.",
    output_target="annotations",
    auto_create_labels=True,
    order=3
)

Setting Confidence Thresholds

Not all predictions are equally reliable. Use confidence thresholds to control quality:

# High threshold: only keep very confident predictions
# Fewer auto-labels, but higher accuracy
processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Conservative Labeler",
    model_type="classification",
    model_id=model.id,
    output_target="annotations",
    confidence_threshold=0.9,  # Only keep 90%+ confidence
    auto_create_labels=True,
    enabled=True
)

# Low threshold: keep more predictions
# More auto-labels, but more corrections needed
processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Aggressive Labeler",
    model_type="classification",
    model_id=model.id,
    output_target="annotations",
    confidence_threshold=0.5,  # Keep 50%+ confidence
    auto_create_labels=True,
    enabled=True
)
  • 0.9+: Use when label accuracy is critical and you’d rather label manually than be wrong
  • 0.7-0.9: Good default for most tasks—review flagged items
  • 0.5-0.7: Use when you have time to review and want maximum coverage

Active Learning Pattern

Combine automated labeling with iterative model improvement:

graph TD
    A[Start: Pre-trained Model] --> B[Auto-label batch of data]
    B --> C[Human reviews corrections]
    C --> D[Retrain model on corrected data]
    D --> E{Model improved?}
    E -->|Yes| F[Auto-label next batch]
    F --> C
    E -->|No| G[Need more diverse data]
    G --> H[Upload new examples]
    H --> B
  1. Start with a pre-trained or LLM-based post-processor
  2. Auto-label a batch of data
  3. Review and correct the labels
  4. Train a new model on the corrected data
  5. Replace the post-processor with your improved model
  6. Repeat—each iteration produces better labels faster
# Iteration 1: Use LLM for initial labels
# (see setup above)

# After review and training...

# Iteration 2: Use your trained model (faster, cheaper)
client.update_post_processor(
    processor_id=processor.id,
    model_id=trained_model_v1.id,  # Your newly trained model
    model_type="classification"
)

# Upload next batch - now labeled by your own model

Best Practices

  1. Start with a sample - Run on 100 items first and check quality before processing thousands
  2. Use confidence thresholds - Don’t trust every prediction equally
  3. Always review - Even 95% accurate auto-labeling means 1 in 20 items is wrong
  4. Track accuracy - Compare auto-labels vs. human-corrected labels to measure quality
  5. Iterate - Replace the labeling model with your retrained model each cycle

Next Step

Once you have a high-quality labeled dataset, use it for Model Distillation—train a smaller, faster model that matches the quality of the large model that labeled the data.