Model Distillation

Use a large, powerful model to create training data, then train a smaller model that runs faster and costs less—while matching the large model’s quality.

Why Distill?

Large models (LLMs, large vision models) are accurate but expensive to run. Smaller, specialized models can match their accuracy on specific tasks at a fraction of the cost.

	Large Model (Teacher)	Small Model (Student)
Accuracy	High across many tasks	High on your specific task
Latency	1-10 seconds	5-50 milliseconds
Cost per prediction	$$$	$
Runs on edge/mobile	No	Yes
Requires GPU	Usually	Often not

The Process

graph TD
    A[1. Large Model Labels Data] --> B[2. Human Reviews Labels]
    B --> C[3. Train Small Model]
    C --> D[4. Evaluate Both Models]
    D --> E{Student matches teacher?}
    E -->|Yes| F[5. Deploy Small Model]
    E -->|No| G[Add More Training Data]
    G --> A

Step 1: Generate Labels with the Large Model

Use a post-processor to label your data with a large model. This is the same process described in Automated Labeling.

from seeme import Client

client = Client()

## Use a large pre-trained classifier as teacher
teacher_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Teacher: Large Classifier",
    model_type="classification",
    model_id=large_model.id,
    output_target="annotations",
    confidence_threshold=0.8,
    auto_create_labels=True,
    enabled=True
)

# Upload your unlabeled data
for path in image_paths:
    client.create_dataset_item(
        version_id=version.id,
        split_id=split.id,
        file_path=path
    )

# Use a large detection model as teacher
teacher_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Teacher: Large Detector",
    model_type="detection",
    model_id=large_detector.id,
    output_target="annotations",
    confidence_threshold=0.7,
    auto_create_labels=True,
    enabled=True
)

# Use a large NER model as teacher
teacher_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Teacher: Large NER",
    model_type="ner",
    model_id=large_ner.id,
    output_target="annotations",
    auto_create_labels=True,
    enabled=True
)

# Use an LLM as teacher (most flexible)
teacher_processor = client.create_post_processor(
    dataset_id=dataset.id,
    name="Teacher: LLM Classifier",
    model_type="llm",
    model_id=llm_model.id,
    prompt="""
    Classify this product image into exactly one category:
    - electronics
    - clothing
    - furniture
    - food
    - toys

    Consider the shape, texture, and context.
    Return only the category name.
    """,
    output_target="annotations",
    auto_create_labels=True,
    enabled=True
)

Step 2: Review the Generated Labels

After the teacher model labels your data, review a sample to ensure quality:

# Check label distribution
items = client.get_dataset_items(
    version_id=version.id,
    split_id=split.id
)

label_counts = {}
for item in items:
    annotations = client.get_annotations(item_id=item.id)
    for ann in annotations:
        label_counts[ann.label] = label_counts.get(ann.label, 0) + 1

print("Label distribution:")
for label, count in sorted(label_counts.items(), key=lambda x: -x[1]):
    print(f"  {label}: {count}")

ℹ️

Review at least 5-10% of the auto-labeled data. Correct any mistakes before training the student model. The student can only be as good as its training data.

Use the web platform annotation interface for efficient review—items are pre-labeled, so you only need to fix errors rather than label from scratch.

Step 3: Train the Small Model

Now train a smaller, faster model on the teacher-labeled data:

# Train a small, fast model on the teacher's labels
student_job = client.create_job(
    dataset_id=dataset.id,
    version_id=version.id,
    name="Student: MobileNet Classifier",
    config={
        "architecture": "mobilenet_v2",  # Small, fast architecture
        "epochs": 30,
        "learning_rate": 0.001,
        "batch_size": 32,
        "image_size": 224,
        "augmentation": {
            "flip": True,
            "rotate": 15,
            "brightness": 0.2,
            "contrast": 0.2
        }
    }
)

# Monitor training
import time
while student_job.status in ["pending", "running"]:
    time.sleep(30)
    student_job = client.get_job(student_job.id)
    if student_job.metrics:
        print(f"Epoch {student_job.metrics.get('epoch', '?')}: "
              f"accuracy={student_job.metrics.get('accuracy', '?')}")

print(f"Training complete: {student_job.status}")

Choosing the Student Architecture

Task	Large Model (Teacher)	Small Model (Student)	Typical Speedup
Image Classification	ResNet-152 / ViT-Large	MobileNet v2 / EfficientNet-B0	10-50x
Object Detection	YOLOv4-large	YOLOv4-tiny	5-20x
Text Classification	BERT-large / LLM	DistilBERT / TinyBERT	5-10x
NER	spaCy lg / LLM	spaCy sm / custom	3-10x

Step 4: Evaluate Both Models

The critical step—compare teacher and student on the same validation set:

# Create a held-out validation set that was NOT used for training
# This should be human-labeled ground truth

# Get teacher predictions on validation set
teacher_results = client.predict(
    model_id=teacher_model.id,
    dataset_id=validation_dataset.id,
    version_id=validation_version.id
)

# Get student predictions on validation set
student_results = client.predict(
    model_id=student_model.id,
    dataset_id=validation_dataset.id,
    version_id=validation_version.id
)

# Compare accuracy
teacher_correct = sum(1 for r in teacher_results if r.prediction == r.ground_truth)
student_correct = sum(1 for r in student_results if r.prediction == r.ground_truth)

teacher_accuracy = teacher_correct / len(teacher_results)
student_accuracy = student_correct / len(student_results)

print(f"Teacher accuracy: {teacher_accuracy:.1%}")
print(f"Student accuracy: {student_accuracy:.1%}")
print(f"Gap: {(teacher_accuracy - student_accuracy):.1%}")

What Results to Expect

Scenario	Teacher	Student	Gap	Action
Excellent	95%	93%	2%	Deploy student
Good	95%	90%	5%	Acceptable for most use cases
Needs work	95%	85%	10%	Add more training data, try larger student
Poor	95%	75%	20%	Task may be too complex for small model

Per-Class Analysis

Don’t just look at overall accuracy. Check performance per class:

# Per-class breakdown
from collections import defaultdict

class_stats = defaultdict(lambda: {"teacher_correct": 0, "student_correct": 0, "total": 0})

for t_result, s_result in zip(teacher_results, student_results):
    label = t_result.ground_truth
    class_stats[label]["total"] += 1
    if t_result.prediction == label:
        class_stats[label]["teacher_correct"] += 1
    if s_result.prediction == label:
        class_stats[label]["student_correct"] += 1

print(f"{'Class':<20} {'Teacher':<10} {'Student':<10} {'Gap':<10}")
print("-" * 50)
for label, stats in sorted(class_stats.items()):
    t_acc = stats["teacher_correct"] / stats["total"]
    s_acc = stats["student_correct"] / stats["total"]
    gap = t_acc - s_acc
    print(f"{label:<20} {t_acc:<10.1%} {s_acc:<10.1%} {gap:<10.1%}")

Step 5: Deploy the Student Model

Once the student meets your quality bar, deploy it:

# Optimize the student model for production
optimized = client.optimize_model(
    model_id=student_model.id,
    target_format="onnx",       # Fast inference format
    quantize=True,              # Reduce model size
    quantize_type="int8"        # 4x smaller, minimal accuracy loss
)

# Deploy to cloud API
deployment = client.deploy_model(
    model_id=optimized.id,
    name="Product Classifier v1",
    replicas=2
)

# Or export for edge/mobile
client.export_model(
    model_id=optimized.id,
    format="onnx",
    output_path="./product_classifier.onnx"
)

Iterative Distillation

Distillation works best as an iterative process:

graph LR
    A[Cycle 1: 1000 items] --> B[Train Student v1]
    B --> C[Evaluate]
    C --> D[Identify Weak Classes]
    D --> E[Cycle 2: Add 500 items for weak classes]
    E --> F[Train Student v2]
    F --> G[Evaluate]
    G --> H[Deploy or Repeat]

# After evaluating, find where the student struggles
weak_classes = []
for label, stats in class_stats.items():
    s_acc = stats["student_correct"] / stats["total"]
    if s_acc < 0.85:  # Below threshold
        weak_classes.append(label)
        print(f"Weak class: {label} ({s_acc:.1%})")

# Upload more examples specifically for weak classes
# Then retrain the student

Real-World Example: Document Classification

Here’s a complete distillation pipeline for classifying scanned documents:

from seeme import Client

client = Client()

# --- Phase 1: Teacher labels data ---

# Use LLM to classify documents (accurate but slow/expensive)
teacher = client.create_post_processor(
    dataset_id=documents_dataset.id,
    name="GPT Document Classifier",
    model_type="llm",
    model_id=llm_model.id,
    prompt="""
    This is a scanned document. Based on the OCR text and visual layout,
    classify it as one of:
    - invoice
    - purchase_order
    - delivery_note
    - contract
    - correspondence

    Return only the category name.
    """,
    output_target="annotations",
    auto_create_labels=True,
    order=2  # After OCR processor
)

# Upload 2000 documents, let teacher label them
# Review ~200 (10%), correct errors

# --- Phase 2: Train student ---

student_job = client.create_job(
    dataset_id=documents_dataset.id,
    version_id=version.id,
    name="Doc Classifier - EfficientNet B0",
    config={
        "architecture": "efficientnet_b0",
        "epochs": 25,
        "image_size": 384,
        "learning_rate": 0.001
    }
)

# --- Phase 3: Compare ---

# Teacher: 94% accuracy, 3.2s per document, $0.01 per call
# Student: 91% accuracy, 12ms per document, ~free after training
# Decision: 3% accuracy gap acceptable, deploy student

# --- Phase 4: Deploy student, retire teacher ---

client.deploy_model(
    model_id=student_model.id,
    name="Document Classifier Production"
)

# Disable the expensive LLM post-processor
client.update_post_processor(
    processor_id=teacher.id,
    enabled=False
)

Best Practices

Use enough training data - The student needs at least 100-500 examples per class
Review teacher labels - Garbage in, garbage out
Choose the right student size - Too small and it can’t learn; too big and you lose the benefit
Always hold out a validation set - Use human-labeled ground truth, not teacher labels
Check per-class performance - Overall accuracy can hide that one class is terrible
Iterate - One round is rarely enough. Add data where the student struggles
Consider the tradeoff - A 3% accuracy drop with 100x speedup is often worth it

Next Step

Combine automated labeling and distillation into a fully automated End-to-End Pipeline.

Automated Labeling End-to-End Pipelines