Production Optimization

Reduce latency, lower costs, and shrink model size for production deployment. Get the most out of your trained models.

Optimization Overview

graph LR
    A[Trained Model] --> B[Format Conversion]
    B --> C[Quantization]
    C --> D[Benchmarking]
    D --> E{Target}
    E --> F[Cloud API]
    E --> G[On-Premise]
    E --> H[Edge / Mobile]

Model Format Conversion

Convert models to optimized inference formats:

Format	Use Case	Speedup	Platform Support
ONNX	General production	2-5x	Cloud, on-prem, edge
ONNX + TensorRT	NVIDIA GPU inference	5-20x	Cloud, on-prem (NVIDIA)
CoreML	Apple devices	3-10x	iOS, macOS
TFLite	Android / embedded	3-10x	Android, edge devices

from seeme import Client

client = Client()

## Convert to ONNX
onnx_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="onnx",
    config={
        "opset_version": 17,
        "input_shape": [1, 3, 224, 224],  # batch, channels, height, width
        "dynamic_batch": True              # Allow variable batch sizes
    }
)

print(f"ONNX model: {onnx_model.id}")
print(f"Size: {onnx_model.size_mb:.1f} MB")

Quantization

Reduce model size and increase speed by lowering numerical precision:

Precision	Size Reduction	Speed Increase	Accuracy Impact
FP32 (default)	1x (baseline)	1x (baseline)	None
FP16	2x smaller	1.5-2x faster	Negligible
INT8	4x smaller	2-4x faster	0.5-2% drop
INT4	8x smaller	3-6x faster	2-5% drop

# Quantize to INT8 (best balance of size/speed/accuracy)
quantized_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="onnx",
    quantize=True,
    quantize_type="int8",
    config={
        # Calibration dataset for INT8 quantization
        "calibration_dataset_id": dataset.id,
        "calibration_samples": 100
    }
)

print(f"Original: {trained_model.size_mb:.1f} MB")
print(f"Quantized: {quantized_model.size_mb:.1f} MB")
print(f"Reduction: {(1 - quantized_model.size_mb / trained_model.size_mb):.0%}")

ℹ️

INT8 quantization requires a calibration dataset—a small sample of representative inputs. The quantization process uses these samples to determine optimal scale factors for each layer.

When to Use Each Precision

Scenario	Recommended Precision
Cloud API with GPU	FP16
Cloud API without GPU	INT8
On-premise server	FP16 or INT8
Mobile app	INT8
Edge device (IoT)	INT8 or INT4
Research / fine-tuning	FP32

Architecture Selection

The model architecture has the biggest impact on inference speed. Choose based on your deployment target:

Image Classification

Architecture	Params	Size	Inference (CPU)	Inference (GPU)	Top-1 Accuracy
MobileNet v2	3.4M	14 MB	5 ms	1 ms	~72%
EfficientNet B0	5.3M	21 MB	8 ms	2 ms	~77%
ResNet-18	11.7M	45 MB	10 ms	2 ms	~70%
ResNet-50	25.6M	98 MB	25 ms	4 ms	~76%
EfficientNet B4	19.3M	75 MB	40 ms	6 ms	~83%
ViT-Base	86.6M	330 MB	80 ms	10 ms	~85%

Object Detection

Architecture	Params	Inference (GPU)	mAP
YOLOv4-tiny	6M	3 ms	~40%
YOLOv4	64M	12 ms	~65%
YOLOv4-large	128M	25 ms	~73%

Need < 10ms latency: MobileNet, EfficientNet-B0, YOLO-tiny
Need best accuracy: EfficientNet-B4+, ViT, YOLO-large
Balanced: EfficientNet-B0/B2, ResNet-50, YOLOv4

Deployment Options

Cloud API

Deploy models as scalable REST endpoints:

# Deploy with auto-scaling
deployment = client.deploy_model(
    model_id=optimized_model.id,
    name="Product Classifier API",
    config={
        "replicas": 2,           # Minimum instances
        "max_replicas": 10,      # Scale up under load
        "gpu": False,            # CPU-only (cheaper)
        "timeout_ms": 5000       # Request timeout
    }
)

print(f"Endpoint: {deployment.url}")

# Use the deployed model
result = client.predict(
    model_id=optimized_model.id,
    item="./test_image.jpg"
)
print(f"Prediction: {result.label} ({result.confidence:.1%})")

On-Premise

Deploy within your infrastructure:

# Export model for on-premise deployment
client.export_model(
    model_id=optimized_model.id,
    format="onnx",
    output_path="./models/classifier.onnx"
)

# Deploy on your SeeMe.ai on-premise instance
# Models run entirely within your infrastructure

Edge / Mobile

Export optimized models for device deployment:

# For iOS
coreml_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="coreml",
    config={"input_shape": [1, 3, 224, 224]}
)

client.export_model(
    model_id=coreml_model.id,
    format="coreml",
    output_path="./mobile/classifier.mlmodel"
)

# For Android
tflite_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="tflite",
    quantize=True,
    quantize_type="int8"
)

client.export_model(
    model_id=tflite_model.id,
    format="tflite",
    output_path="./mobile/classifier.tflite"
)

Benchmarking

Always benchmark before deploying. Compare latency, throughput, and accuracy:

import time

# Benchmark function
def benchmark_model(client, model_id, test_images, num_runs=100):
    # Warmup
    for img in test_images[:5]:
        client.predict(model_id=model_id, item=img)

    # Timed runs
    start = time.time()
    results = []
    for i in range(num_runs):
        img = test_images[i % len(test_images)]
        result = client.predict(model_id=model_id, item=img)
        results.append(result)
    elapsed = time.time() - start

    avg_latency = (elapsed / num_runs) * 1000  # ms
    throughput = num_runs / elapsed  # predictions/sec

    return avg_latency, throughput, results

# Compare original vs optimized
original_latency, original_throughput, original_results = benchmark_model(
    client, original_model.id, test_images
)

optimized_latency, optimized_throughput, optimized_results = benchmark_model(
    client, optimized_model.id, test_images
)

print(f"{'Metric':<20} {'Original':<15} {'Optimized':<15} {'Improvement':<15}")
print("-" * 65)
print(f"{'Latency (ms)':<20} {original_latency:<15.1f} {optimized_latency:<15.1f} {original_latency/optimized_latency:<15.1f}x")
print(f"{'Throughput (/s)':<20} {original_throughput:<15.1f} {optimized_throughput:<15.1f} {optimized_throughput/original_throughput:<15.1f}x")

Batch Inference

For bulk processing, use batch inference instead of single predictions:

# Single inference (slow for large volumes)
for image_path in all_images:
    result = client.predict(model_id=model.id, item=image_path)

# Batch inference (much faster)
results = client.predict_batch(
    model_id=model.id,
    file_paths=all_images,
    batch_size=32  # Process 32 at a time
)

Optimization Checklist

Use this checklist when preparing a model for production:

Step	Action	Impact
1	Choose smallest architecture that meets accuracy requirements	High
2	Convert to ONNX format	Medium
3	Apply INT8 quantization	Medium-High
4	Benchmark latency and throughput	-
5	Verify accuracy on validation set post-optimization	-
6	Configure auto-scaling for cloud deployment	Medium
7	Set up monitoring for inference latency and errors	-
8	Plan retraining schedule	-

Best Practices

Measure before optimizing - Profile your model to find the actual bottleneck
Don’t over-optimize - If 50ms latency is fine, you don’t need to squeeze to 5ms
Test accuracy after every optimization - Quantization can hurt specific classes
Use batch inference for offline processing - Single-item inference wastes throughput
Match the format to the target - ONNX for servers, CoreML for iOS, TFLite for Android
Consider the full pipeline - Model inference is often not the slowest part (network, preprocessing, postprocessing)