Automated Labeling
Labeling data is the biggest bottleneck in machine learning. Post-processors eliminate most of that manual work by using existing AI models to pre-label your data automatically.
The Problem
Manual labeling is slow and expensive:
| Labeling Method | Speed | Cost | Quality |
|---|---|---|---|
| Fully manual | ~100 items/hour | High | High (if expert) |
| Post-processor + review | ~1000 items/hour | Low | High |
| Post-processor only | Unlimited | Minimal | Moderate |
The sweet spot: use post-processors to generate labels, then have humans review and correct them. This is 10x faster than labeling from scratch.
How It Works
graph LR
A[Upload Raw Data] --> B[Post-Processor Runs]
B --> C[Auto-Generated Labels]
C --> D[Human Review]
D --> E[Corrected Labels]
E --> F[Training-Ready Dataset]- Upload your unlabeled data to a dataset
- Post-processors run automatically on every item
- Labels appear as annotations on each item
- Review and correct mistakes in the annotation interface
- Train on the corrected dataset
Setup
Step 1: Choose Your Labeling Strategy
Pick the right post-processor type for your task:
| Task | Post-Processor Type | What It Produces |
|---|---|---|
| Categorize images | classification | Category labels |
| Find objects in images | detection | Bounding boxes |
| Extract entities from text | ner | Entity spans |
| Classify text/documents | classification | Category labels |
| Extract structured data | llm | Custom fields |
| Read text from images | ocr | Text content |
Step 2: Create the Post-Processor
Step 3: Upload Data
Upload your unlabeled data. Post-processors run automatically:
import glob
# Upload images
for image_path in glob.glob("./unlabeled_images/*.jpg"):
client.create_dataset_item(
version_id=version.id,
split_id=split.id,
file_path=image_path
)
# Monitor processing
jobs = client.get_post_processor_jobs(
dataset_id=dataset.id,
status="pending"
)
print(f"{len(jobs)} items queued for labeling")Step 4: Review and Correct
After processing completes, review the auto-generated labels:
Using an LLM as a Labeling Oracle
Large language models can label data with remarkable accuracy, especially for tasks that benefit from reasoning:
# Use a large model (e.g., Ollama-hosted LLM) as a labeling oracle
llm_processor = client.create_post_processor(
dataset_id=dataset.id,
name="LLM Oracle",
model_type="llm",
model_id=large_llm.id,
prompt="""
You are an expert annotator. Look at this image and:
1. Identify what the main subject is
2. Classify it into one of these categories:
{label_list}
3. Rate your confidence (low/medium/high)
Return JSON: {"label": "...", "confidence": "..."}
""",
output_target="annotations",
auto_create_labels=True,
enabled=True
)This approach is central to Model Distillation—the LLM generates the training data, and you train a smaller, faster model on those labels.
Chaining Post-Processors for Complex Labeling
For multi-step labeling tasks, chain processors in sequence:
# Step 1: OCR to extract text from documents
ocr_processor = client.create_post_processor(
dataset_id=dataset.id,
name="Extract Text",
model_type="ocr",
model_id=ocr_model.id,
output_target="text",
order=1
)
# Step 2: NER to find entities in extracted text
ner_processor = client.create_post_processor(
dataset_id=dataset.id,
name="Find Entities",
model_type="ner",
model_id=ner_model.id,
output_target="annotations",
auto_create_labels=True,
order=2
)
# Step 3: LLM to classify based on content
classify_processor = client.create_post_processor(
dataset_id=dataset.id,
name="Classify Document",
model_type="llm",
model_id=llm_model.id,
prompt="Based on this document, classify it as: invoice, contract, letter, or report. Return only the category.",
output_target="annotations",
auto_create_labels=True,
order=3
)Setting Confidence Thresholds
Not all predictions are equally reliable. Use confidence thresholds to control quality:
# High threshold: only keep very confident predictions
# Fewer auto-labels, but higher accuracy
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Conservative Labeler",
model_type="classification",
model_id=model.id,
output_target="annotations",
confidence_threshold=0.9, # Only keep 90%+ confidence
auto_create_labels=True,
enabled=True
)
# Low threshold: keep more predictions
# More auto-labels, but more corrections needed
processor = client.create_post_processor(
dataset_id=dataset.id,
name="Aggressive Labeler",
model_type="classification",
model_id=model.id,
output_target="annotations",
confidence_threshold=0.5, # Keep 50%+ confidence
auto_create_labels=True,
enabled=True
)- 0.9+: Use when label accuracy is critical and you’d rather label manually than be wrong
- 0.7-0.9: Good default for most tasks—review flagged items
- 0.5-0.7: Use when you have time to review and want maximum coverage
Active Learning Pattern
Combine automated labeling with iterative model improvement:
graph TD
A[Start: Pre-trained Model] --> B[Auto-label batch of data]
B --> C[Human reviews corrections]
C --> D[Retrain model on corrected data]
D --> E{Model improved?}
E -->|Yes| F[Auto-label next batch]
F --> C
E -->|No| G[Need more diverse data]
G --> H[Upload new examples]
H --> B- Start with a pre-trained or LLM-based post-processor
- Auto-label a batch of data
- Review and correct the labels
- Train a new model on the corrected data
- Replace the post-processor with your improved model
- Repeat—each iteration produces better labels faster
# Iteration 1: Use LLM for initial labels
# (see setup above)
# After review and training...
# Iteration 2: Use your trained model (faster, cheaper)
client.update_post_processor(
processor_id=processor.id,
model_id=trained_model_v1.id, # Your newly trained model
model_type="classification"
)
# Upload next batch - now labeled by your own modelBest Practices
- Start with a sample - Run on 100 items first and check quality before processing thousands
- Use confidence thresholds - Don’t trust every prediction equally
- Always review - Even 95% accurate auto-labeling means 1 in 20 items is wrong
- Track accuracy - Compare auto-labels vs. human-corrected labels to measure quality
- Iterate - Replace the labeling model with your retrained model each cycle
Next Step
Once you have a high-quality labeled dataset, use it for Model Distillation—train a smaller, faster model that matches the quality of the large model that labeled the data.