Production Optimization
Production Optimization
Reduce latency, lower costs, and shrink model size for production deployment. Get the most out of your trained models.
Optimization Overview
graph LR
A[Trained Model] --> B[Format Conversion]
B --> C[Quantization]
C --> D[Benchmarking]
D --> E{Target}
E --> F[Cloud API]
E --> G[On-Premise]
E --> H[Edge / Mobile]Model Format Conversion
Convert models to optimized inference formats:
| Format | Use Case | Speedup | Platform Support |
|---|---|---|---|
| ONNX | General production | 2-5x | Cloud, on-prem, edge |
| ONNX + TensorRT | NVIDIA GPU inference | 5-20x | Cloud, on-prem (NVIDIA) |
| CoreML | Apple devices | 3-10x | iOS, macOS |
| TFLite | Android / embedded | 3-10x | Android, edge devices |
Quantization
Reduce model size and increase speed by lowering numerical precision:
| Precision | Size Reduction | Speed Increase | Accuracy Impact |
|---|---|---|---|
| FP32 (default) | 1x (baseline) | 1x (baseline) | None |
| FP16 | 2x smaller | 1.5-2x faster | Negligible |
| INT8 | 4x smaller | 2-4x faster | 0.5-2% drop |
| INT4 | 8x smaller | 3-6x faster | 2-5% drop |
# Quantize to INT8 (best balance of size/speed/accuracy)
quantized_model = client.optimize_model(
model_id=trained_model.id,
target_format="onnx",
quantize=True,
quantize_type="int8",
config={
# Calibration dataset for INT8 quantization
"calibration_dataset_id": dataset.id,
"calibration_samples": 100
}
)
print(f"Original: {trained_model.size_mb:.1f} MB")
print(f"Quantized: {quantized_model.size_mb:.1f} MB")
print(f"Reduction: {(1 - quantized_model.size_mb / trained_model.size_mb):.0%}")ℹ️
INT8 quantization requires a calibration dataset—a small sample of representative inputs. The quantization process uses these samples to determine optimal scale factors for each layer.
When to Use Each Precision
| Scenario | Recommended Precision |
|---|---|
| Cloud API with GPU | FP16 |
| Cloud API without GPU | INT8 |
| On-premise server | FP16 or INT8 |
| Mobile app | INT8 |
| Edge device (IoT) | INT8 or INT4 |
| Research / fine-tuning | FP32 |
Architecture Selection
The model architecture has the biggest impact on inference speed. Choose based on your deployment target:
Image Classification
| Architecture | Params | Size | Inference (CPU) | Inference (GPU) | Top-1 Accuracy |
|---|---|---|---|---|---|
| MobileNet v2 | 3.4M | 14 MB | 5 ms | 1 ms | ~72% |
| EfficientNet B0 | 5.3M | 21 MB | 8 ms | 2 ms | ~77% |
| ResNet-18 | 11.7M | 45 MB | 10 ms | 2 ms | ~70% |
| ResNet-50 | 25.6M | 98 MB | 25 ms | 4 ms | ~76% |
| EfficientNet B4 | 19.3M | 75 MB | 40 ms | 6 ms | ~83% |
| ViT-Base | 86.6M | 330 MB | 80 ms | 10 ms | ~85% |
Object Detection
| Architecture | Params | Inference (GPU) | mAP |
|---|---|---|---|
| YOLOv4-tiny | 6M | 3 ms | ~40% |
| YOLOv4 | 64M | 12 ms | ~65% |
| YOLOv4-large | 128M | 25 ms | ~73% |
- Need < 10ms latency: MobileNet, EfficientNet-B0, YOLO-tiny
- Need best accuracy: EfficientNet-B4+, ViT, YOLO-large
- Balanced: EfficientNet-B0/B2, ResNet-50, YOLOv4
Deployment Options
Cloud API
Deploy models as scalable REST endpoints:
# Deploy with auto-scaling
deployment = client.deploy_model(
model_id=optimized_model.id,
name="Product Classifier API",
config={
"replicas": 2, # Minimum instances
"max_replicas": 10, # Scale up under load
"gpu": False, # CPU-only (cheaper)
"timeout_ms": 5000 # Request timeout
}
)
print(f"Endpoint: {deployment.url}")
# Use the deployed model
result = client.predict(
model_id=optimized_model.id,
item="./test_image.jpg"
)
print(f"Prediction: {result.label} ({result.confidence:.1%})")On-Premise
Deploy within your infrastructure:
# Export model for on-premise deployment
client.export_model(
model_id=optimized_model.id,
format="onnx",
output_path="./models/classifier.onnx"
)
# Deploy on your SeeMe.ai on-premise instance
# Models run entirely within your infrastructureEdge / Mobile
Export optimized models for device deployment:
# For iOS
coreml_model = client.optimize_model(
model_id=trained_model.id,
target_format="coreml",
config={"input_shape": [1, 3, 224, 224]}
)
client.export_model(
model_id=coreml_model.id,
format="coreml",
output_path="./mobile/classifier.mlmodel"
)
# For Android
tflite_model = client.optimize_model(
model_id=trained_model.id,
target_format="tflite",
quantize=True,
quantize_type="int8"
)
client.export_model(
model_id=tflite_model.id,
format="tflite",
output_path="./mobile/classifier.tflite"
)Benchmarking
Always benchmark before deploying. Compare latency, throughput, and accuracy:
import time
# Benchmark function
def benchmark_model(client, model_id, test_images, num_runs=100):
# Warmup
for img in test_images[:5]:
client.predict(model_id=model_id, item=img)
# Timed runs
start = time.time()
results = []
for i in range(num_runs):
img = test_images[i % len(test_images)]
result = client.predict(model_id=model_id, item=img)
results.append(result)
elapsed = time.time() - start
avg_latency = (elapsed / num_runs) * 1000 # ms
throughput = num_runs / elapsed # predictions/sec
return avg_latency, throughput, results
# Compare original vs optimized
original_latency, original_throughput, original_results = benchmark_model(
client, original_model.id, test_images
)
optimized_latency, optimized_throughput, optimized_results = benchmark_model(
client, optimized_model.id, test_images
)
print(f"{'Metric':<20} {'Original':<15} {'Optimized':<15} {'Improvement':<15}")
print("-" * 65)
print(f"{'Latency (ms)':<20} {original_latency:<15.1f} {optimized_latency:<15.1f} {original_latency/optimized_latency:<15.1f}x")
print(f"{'Throughput (/s)':<20} {original_throughput:<15.1f} {optimized_throughput:<15.1f} {optimized_throughput/original_throughput:<15.1f}x")Batch Inference
For bulk processing, use batch inference instead of single predictions:
# Single inference (slow for large volumes)
for image_path in all_images:
result = client.predict(model_id=model.id, item=image_path)
# Batch inference (much faster)
results = client.predict_batch(
model_id=model.id,
file_paths=all_images,
batch_size=32 # Process 32 at a time
)Optimization Checklist
Use this checklist when preparing a model for production:
| Step | Action | Impact |
|---|---|---|
| 1 | Choose smallest architecture that meets accuracy requirements | High |
| 2 | Convert to ONNX format | Medium |
| 3 | Apply INT8 quantization | Medium-High |
| 4 | Benchmark latency and throughput | - |
| 5 | Verify accuracy on validation set post-optimization | - |
| 6 | Configure auto-scaling for cloud deployment | Medium |
| 7 | Set up monitoring for inference latency and errors | - |
| 8 | Plan retraining schedule | - |
Best Practices
- Measure before optimizing - Profile your model to find the actual bottleneck
- Don’t over-optimize - If 50ms latency is fine, you don’t need to squeeze to 5ms
- Test accuracy after every optimization - Quantization can hurt specific classes
- Use batch inference for offline processing - Single-item inference wastes throughput
- Match the format to the target - ONNX for servers, CoreML for iOS, TFLite for Android
- Consider the full pipeline - Model inference is often not the slowest part (network, preprocessing, postprocessing)