Model Distillation

Model Distillation

Use a large, powerful model to create training data, then train a smaller model that runs faster and costs less—while matching the large model’s quality.

Why Distill?

Large models (LLMs, large vision models) are accurate but expensive to run. Smaller, specialized models can match their accuracy on specific tasks at a fraction of the cost.

Large Model (Teacher)Small Model (Student)
AccuracyHigh across many tasksHigh on your specific task
Latency1-10 seconds5-50 milliseconds
Cost per prediction$$$$
Runs on edge/mobileNoYes
Requires GPUUsuallyOften not

The Process

graph TD
    A[1. Large Model Labels Data] --> B[2. Human Reviews Labels]
    B --> C[3. Train Small Model]
    C --> D[4. Evaluate Both Models]
    D --> E{Student matches teacher?}
    E -->|Yes| F[5. Deploy Small Model]
    E -->|No| G[Add More Training Data]
    G --> A

Step 1: Generate Labels with the Large Model

Use a post-processor to label your data with a large model. This is the same process described in Automated Labeling.

Step 2: Review the Generated Labels

After the teacher model labels your data, review a sample to ensure quality:

# Check label distribution
items = client.get_dataset_items(
    version_id=version.id,
    split_id=split.id
)

label_counts = {}
for item in items:
    annotations = client.get_annotations(item_id=item.id)
    for ann in annotations:
        label_counts[ann.label] = label_counts.get(ann.label, 0) + 1

print("Label distribution:")
for label, count in sorted(label_counts.items(), key=lambda x: -x[1]):
    print(f"  {label}: {count}")
ℹ️
Review at least 5-10% of the auto-labeled data. Correct any mistakes before training the student model. The student can only be as good as its training data.

Use the web platform annotation interface for efficient review—items are pre-labeled, so you only need to fix errors rather than label from scratch.

Step 3: Train the Small Model

Now train a smaller, faster model on the teacher-labeled data:

Choosing the Student Architecture

TaskLarge Model (Teacher)Small Model (Student)Typical Speedup
Image ClassificationResNet-152 / ViT-LargeMobileNet v2 / EfficientNet-B010-50x
Object DetectionYOLOv4-largeYOLOv4-tiny5-20x
Text ClassificationBERT-large / LLMDistilBERT / TinyBERT5-10x
NERspaCy lg / LLMspaCy sm / custom3-10x

Step 4: Evaluate Both Models

The critical step—compare teacher and student on the same validation set:

# Create a held-out validation set that was NOT used for training
# This should be human-labeled ground truth

# Get teacher predictions on validation set
teacher_results = client.predict(
    model_id=teacher_model.id,
    dataset_id=validation_dataset.id,
    version_id=validation_version.id
)

# Get student predictions on validation set
student_results = client.predict(
    model_id=student_model.id,
    dataset_id=validation_dataset.id,
    version_id=validation_version.id
)

# Compare accuracy
teacher_correct = sum(1 for r in teacher_results if r.prediction == r.ground_truth)
student_correct = sum(1 for r in student_results if r.prediction == r.ground_truth)

teacher_accuracy = teacher_correct / len(teacher_results)
student_accuracy = student_correct / len(student_results)

print(f"Teacher accuracy: {teacher_accuracy:.1%}")
print(f"Student accuracy: {student_accuracy:.1%}")
print(f"Gap: {(teacher_accuracy - student_accuracy):.1%}")

What Results to Expect

ScenarioTeacherStudentGapAction
Excellent95%93%2%Deploy student
Good95%90%5%Acceptable for most use cases
Needs work95%85%10%Add more training data, try larger student
Poor95%75%20%Task may be too complex for small model

Per-Class Analysis

Don’t just look at overall accuracy. Check performance per class:

# Per-class breakdown
from collections import defaultdict

class_stats = defaultdict(lambda: {"teacher_correct": 0, "student_correct": 0, "total": 0})

for t_result, s_result in zip(teacher_results, student_results):
    label = t_result.ground_truth
    class_stats[label]["total"] += 1
    if t_result.prediction == label:
        class_stats[label]["teacher_correct"] += 1
    if s_result.prediction == label:
        class_stats[label]["student_correct"] += 1

print(f"{'Class':<20} {'Teacher':<10} {'Student':<10} {'Gap':<10}")
print("-" * 50)
for label, stats in sorted(class_stats.items()):
    t_acc = stats["teacher_correct"] / stats["total"]
    s_acc = stats["student_correct"] / stats["total"]
    gap = t_acc - s_acc
    print(f"{label:<20} {t_acc:<10.1%} {s_acc:<10.1%} {gap:<10.1%}")

Step 5: Deploy the Student Model

Once the student meets your quality bar, deploy it:

# Optimize the student model for production
optimized = client.optimize_model(
    model_id=student_model.id,
    target_format="onnx",       # Fast inference format
    quantize=True,              # Reduce model size
    quantize_type="int8"        # 4x smaller, minimal accuracy loss
)

# Deploy to cloud API
deployment = client.deploy_model(
    model_id=optimized.id,
    name="Product Classifier v1",
    replicas=2
)

# Or export for edge/mobile
client.export_model(
    model_id=optimized.id,
    format="onnx",
    output_path="./product_classifier.onnx"
)

Iterative Distillation

Distillation works best as an iterative process:

graph LR
    A[Cycle 1: 1000 items] --> B[Train Student v1]
    B --> C[Evaluate]
    C --> D[Identify Weak Classes]
    D --> E[Cycle 2: Add 500 items for weak classes]
    E --> F[Train Student v2]
    F --> G[Evaluate]
    G --> H[Deploy or Repeat]
# After evaluating, find where the student struggles
weak_classes = []
for label, stats in class_stats.items():
    s_acc = stats["student_correct"] / stats["total"]
    if s_acc < 0.85:  # Below threshold
        weak_classes.append(label)
        print(f"Weak class: {label} ({s_acc:.1%})")

# Upload more examples specifically for weak classes
# Then retrain the student

Real-World Example: Document Classification

Here’s a complete distillation pipeline for classifying scanned documents:

from seeme import Client

client = Client()

# --- Phase 1: Teacher labels data ---

# Use LLM to classify documents (accurate but slow/expensive)
teacher = client.create_post_processor(
    dataset_id=documents_dataset.id,
    name="GPT Document Classifier",
    model_type="llm",
    model_id=llm_model.id,
    prompt="""
    This is a scanned document. Based on the OCR text and visual layout,
    classify it as one of:
    - invoice
    - purchase_order
    - delivery_note
    - contract
    - correspondence

    Return only the category name.
    """,
    output_target="annotations",
    auto_create_labels=True,
    order=2  # After OCR processor
)

# Upload 2000 documents, let teacher label them
# Review ~200 (10%), correct errors

# --- Phase 2: Train student ---

student_job = client.create_job(
    dataset_id=documents_dataset.id,
    version_id=version.id,
    name="Doc Classifier - EfficientNet B0",
    config={
        "architecture": "efficientnet_b0",
        "epochs": 25,
        "image_size": 384,
        "learning_rate": 0.001
    }
)

# --- Phase 3: Compare ---

# Teacher: 94% accuracy, 3.2s per document, $0.01 per call
# Student: 91% accuracy, 12ms per document, ~free after training
# Decision: 3% accuracy gap acceptable, deploy student

# --- Phase 4: Deploy student, retire teacher ---

client.deploy_model(
    model_id=student_model.id,
    name="Document Classifier Production"
)

# Disable the expensive LLM post-processor
client.update_post_processor(
    processor_id=teacher.id,
    enabled=False
)

Best Practices

  1. Use enough training data - The student needs at least 100-500 examples per class
  2. Review teacher labels - Garbage in, garbage out
  3. Choose the right student size - Too small and it can’t learn; too big and you lose the benefit
  4. Always hold out a validation set - Use human-labeled ground truth, not teacher labels
  5. Check per-class performance - Overall accuracy can hide that one class is terrible
  6. Iterate - One round is rarely enough. Add data where the student struggles
  7. Consider the tradeoff - A 3% accuracy drop with 100x speedup is often worth it

Next Step

Combine automated labeling and distillation into a fully automated End-to-End Pipeline.