Model Distillation
Use a large, powerful model to create training data, then train a smaller model that runs faster and costs less—while matching the large model’s quality.
Why Distill?
Large models (LLMs, large vision models) are accurate but expensive to run. Smaller, specialized models can match their accuracy on specific tasks at a fraction of the cost.
| Large Model (Teacher) | Small Model (Student) | |
|---|---|---|
| Accuracy | High across many tasks | High on your specific task |
| Latency | 1-10 seconds | 5-50 milliseconds |
| Cost per prediction | $$$ | $ |
| Runs on edge/mobile | No | Yes |
| Requires GPU | Usually | Often not |
The Process
graph TD
A[1. Large Model Labels Data] --> B[2. Human Reviews Labels]
B --> C[3. Train Small Model]
C --> D[4. Evaluate Both Models]
D --> E{Student matches teacher?}
E -->|Yes| F[5. Deploy Small Model]
E -->|No| G[Add More Training Data]
G --> AStep 1: Generate Labels with the Large Model
Use a post-processor to label your data with a large model. This is the same process described in Automated Labeling.
Step 2: Review the Generated Labels
After the teacher model labels your data, review a sample to ensure quality:
# Check label distribution
items = client.get_dataset_items(
version_id=version.id,
split_id=split.id
)
label_counts = {}
for item in items:
annotations = client.get_annotations(item_id=item.id)
for ann in annotations:
label_counts[ann.label] = label_counts.get(ann.label, 0) + 1
print("Label distribution:")
for label, count in sorted(label_counts.items(), key=lambda x: -x[1]):
print(f" {label}: {count}")Use the web platform annotation interface for efficient review—items are pre-labeled, so you only need to fix errors rather than label from scratch.
Step 3: Train the Small Model
Now train a smaller, faster model on the teacher-labeled data:
Choosing the Student Architecture
| Task | Large Model (Teacher) | Small Model (Student) | Typical Speedup |
|---|---|---|---|
| Image Classification | ResNet-152 / ViT-Large | MobileNet v2 / EfficientNet-B0 | 10-50x |
| Object Detection | YOLOv4-large | YOLOv4-tiny | 5-20x |
| Text Classification | BERT-large / LLM | DistilBERT / TinyBERT | 5-10x |
| NER | spaCy lg / LLM | spaCy sm / custom | 3-10x |
Step 4: Evaluate Both Models
The critical step—compare teacher and student on the same validation set:
# Create a held-out validation set that was NOT used for training
# This should be human-labeled ground truth
# Get teacher predictions on validation set
teacher_results = client.predict(
model_id=teacher_model.id,
dataset_id=validation_dataset.id,
version_id=validation_version.id
)
# Get student predictions on validation set
student_results = client.predict(
model_id=student_model.id,
dataset_id=validation_dataset.id,
version_id=validation_version.id
)
# Compare accuracy
teacher_correct = sum(1 for r in teacher_results if r.prediction == r.ground_truth)
student_correct = sum(1 for r in student_results if r.prediction == r.ground_truth)
teacher_accuracy = teacher_correct / len(teacher_results)
student_accuracy = student_correct / len(student_results)
print(f"Teacher accuracy: {teacher_accuracy:.1%}")
print(f"Student accuracy: {student_accuracy:.1%}")
print(f"Gap: {(teacher_accuracy - student_accuracy):.1%}")What Results to Expect
| Scenario | Teacher | Student | Gap | Action |
|---|---|---|---|---|
| Excellent | 95% | 93% | 2% | Deploy student |
| Good | 95% | 90% | 5% | Acceptable for most use cases |
| Needs work | 95% | 85% | 10% | Add more training data, try larger student |
| Poor | 95% | 75% | 20% | Task may be too complex for small model |
Per-Class Analysis
Don’t just look at overall accuracy. Check performance per class:
# Per-class breakdown
from collections import defaultdict
class_stats = defaultdict(lambda: {"teacher_correct": 0, "student_correct": 0, "total": 0})
for t_result, s_result in zip(teacher_results, student_results):
label = t_result.ground_truth
class_stats[label]["total"] += 1
if t_result.prediction == label:
class_stats[label]["teacher_correct"] += 1
if s_result.prediction == label:
class_stats[label]["student_correct"] += 1
print(f"{'Class':<20} {'Teacher':<10} {'Student':<10} {'Gap':<10}")
print("-" * 50)
for label, stats in sorted(class_stats.items()):
t_acc = stats["teacher_correct"] / stats["total"]
s_acc = stats["student_correct"] / stats["total"]
gap = t_acc - s_acc
print(f"{label:<20} {t_acc:<10.1%} {s_acc:<10.1%} {gap:<10.1%}")Step 5: Deploy the Student Model
Once the student meets your quality bar, deploy it:
# Optimize the student model for production
optimized = client.optimize_model(
model_id=student_model.id,
target_format="onnx", # Fast inference format
quantize=True, # Reduce model size
quantize_type="int8" # 4x smaller, minimal accuracy loss
)
# Deploy to cloud API
deployment = client.deploy_model(
model_id=optimized.id,
name="Product Classifier v1",
replicas=2
)
# Or export for edge/mobile
client.export_model(
model_id=optimized.id,
format="onnx",
output_path="./product_classifier.onnx"
)Iterative Distillation
Distillation works best as an iterative process:
graph LR
A[Cycle 1: 1000 items] --> B[Train Student v1]
B --> C[Evaluate]
C --> D[Identify Weak Classes]
D --> E[Cycle 2: Add 500 items for weak classes]
E --> F[Train Student v2]
F --> G[Evaluate]
G --> H[Deploy or Repeat]# After evaluating, find where the student struggles
weak_classes = []
for label, stats in class_stats.items():
s_acc = stats["student_correct"] / stats["total"]
if s_acc < 0.85: # Below threshold
weak_classes.append(label)
print(f"Weak class: {label} ({s_acc:.1%})")
# Upload more examples specifically for weak classes
# Then retrain the studentReal-World Example: Document Classification
Here’s a complete distillation pipeline for classifying scanned documents:
from seeme import Client
client = Client()
# --- Phase 1: Teacher labels data ---
# Use LLM to classify documents (accurate but slow/expensive)
teacher = client.create_post_processor(
dataset_id=documents_dataset.id,
name="GPT Document Classifier",
model_type="llm",
model_id=llm_model.id,
prompt="""
This is a scanned document. Based on the OCR text and visual layout,
classify it as one of:
- invoice
- purchase_order
- delivery_note
- contract
- correspondence
Return only the category name.
""",
output_target="annotations",
auto_create_labels=True,
order=2 # After OCR processor
)
# Upload 2000 documents, let teacher label them
# Review ~200 (10%), correct errors
# --- Phase 2: Train student ---
student_job = client.create_job(
dataset_id=documents_dataset.id,
version_id=version.id,
name="Doc Classifier - EfficientNet B0",
config={
"architecture": "efficientnet_b0",
"epochs": 25,
"image_size": 384,
"learning_rate": 0.001
}
)
# --- Phase 3: Compare ---
# Teacher: 94% accuracy, 3.2s per document, $0.01 per call
# Student: 91% accuracy, 12ms per document, ~free after training
# Decision: 3% accuracy gap acceptable, deploy student
# --- Phase 4: Deploy student, retire teacher ---
client.deploy_model(
model_id=student_model.id,
name="Document Classifier Production"
)
# Disable the expensive LLM post-processor
client.update_post_processor(
processor_id=teacher.id,
enabled=False
)Best Practices
- Use enough training data - The student needs at least 100-500 examples per class
- Review teacher labels - Garbage in, garbage out
- Choose the right student size - Too small and it can’t learn; too big and you lose the benefit
- Always hold out a validation set - Use human-labeled ground truth, not teacher labels
- Check per-class performance - Overall accuracy can hide that one class is terrible
- Iterate - One round is rarely enough. Add data where the student struggles
- Consider the tradeoff - A 3% accuracy drop with 100x speedup is often worth it
Next Step
Combine automated labeling and distillation into a fully automated End-to-End Pipeline.