Production Optimization

Production Optimization

Reduce latency, lower costs, and shrink model size for production deployment. Get the most out of your trained models.

Optimization Overview

graph LR
    A[Trained Model] --> B[Format Conversion]
    B --> C[Quantization]
    C --> D[Benchmarking]
    D --> E{Target}
    E --> F[Cloud API]
    E --> G[On-Premise]
    E --> H[Edge / Mobile]

Model Format Conversion

Convert models to optimized inference formats:

FormatUse CaseSpeedupPlatform Support
ONNXGeneral production2-5xCloud, on-prem, edge
ONNX + TensorRTNVIDIA GPU inference5-20xCloud, on-prem (NVIDIA)
CoreMLApple devices3-10xiOS, macOS
TFLiteAndroid / embedded3-10xAndroid, edge devices

Quantization

Reduce model size and increase speed by lowering numerical precision:

PrecisionSize ReductionSpeed IncreaseAccuracy Impact
FP32 (default)1x (baseline)1x (baseline)None
FP162x smaller1.5-2x fasterNegligible
INT84x smaller2-4x faster0.5-2% drop
INT48x smaller3-6x faster2-5% drop
# Quantize to INT8 (best balance of size/speed/accuracy)
quantized_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="onnx",
    quantize=True,
    quantize_type="int8",
    config={
        # Calibration dataset for INT8 quantization
        "calibration_dataset_id": dataset.id,
        "calibration_samples": 100
    }
)

print(f"Original: {trained_model.size_mb:.1f} MB")
print(f"Quantized: {quantized_model.size_mb:.1f} MB")
print(f"Reduction: {(1 - quantized_model.size_mb / trained_model.size_mb):.0%}")
ℹ️
INT8 quantization requires a calibration dataset—a small sample of representative inputs. The quantization process uses these samples to determine optimal scale factors for each layer.

When to Use Each Precision

ScenarioRecommended Precision
Cloud API with GPUFP16
Cloud API without GPUINT8
On-premise serverFP16 or INT8
Mobile appINT8
Edge device (IoT)INT8 or INT4
Research / fine-tuningFP32

Architecture Selection

The model architecture has the biggest impact on inference speed. Choose based on your deployment target:

Image Classification

ArchitectureParamsSizeInference (CPU)Inference (GPU)Top-1 Accuracy
MobileNet v23.4M14 MB5 ms1 ms~72%
EfficientNet B05.3M21 MB8 ms2 ms~77%
ResNet-1811.7M45 MB10 ms2 ms~70%
ResNet-5025.6M98 MB25 ms4 ms~76%
EfficientNet B419.3M75 MB40 ms6 ms~83%
ViT-Base86.6M330 MB80 ms10 ms~85%

Object Detection

ArchitectureParamsInference (GPU)mAP
YOLOv4-tiny6M3 ms~40%
YOLOv464M12 ms~65%
YOLOv4-large128M25 ms~73%
  • Need < 10ms latency: MobileNet, EfficientNet-B0, YOLO-tiny
  • Need best accuracy: EfficientNet-B4+, ViT, YOLO-large
  • Balanced: EfficientNet-B0/B2, ResNet-50, YOLOv4

Deployment Options

Cloud API

Deploy models as scalable REST endpoints:

# Deploy with auto-scaling
deployment = client.deploy_model(
    model_id=optimized_model.id,
    name="Product Classifier API",
    config={
        "replicas": 2,           # Minimum instances
        "max_replicas": 10,      # Scale up under load
        "gpu": False,            # CPU-only (cheaper)
        "timeout_ms": 5000       # Request timeout
    }
)

print(f"Endpoint: {deployment.url}")

# Use the deployed model
result = client.predict(
    model_id=optimized_model.id,
    item="./test_image.jpg"
)
print(f"Prediction: {result.label} ({result.confidence:.1%})")

On-Premise

Deploy within your infrastructure:

# Export model for on-premise deployment
client.export_model(
    model_id=optimized_model.id,
    format="onnx",
    output_path="./models/classifier.onnx"
)

# Deploy on your SeeMe.ai on-premise instance
# Models run entirely within your infrastructure

Edge / Mobile

Export optimized models for device deployment:

# For iOS
coreml_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="coreml",
    config={"input_shape": [1, 3, 224, 224]}
)

client.export_model(
    model_id=coreml_model.id,
    format="coreml",
    output_path="./mobile/classifier.mlmodel"
)

# For Android
tflite_model = client.optimize_model(
    model_id=trained_model.id,
    target_format="tflite",
    quantize=True,
    quantize_type="int8"
)

client.export_model(
    model_id=tflite_model.id,
    format="tflite",
    output_path="./mobile/classifier.tflite"
)

Benchmarking

Always benchmark before deploying. Compare latency, throughput, and accuracy:

import time

# Benchmark function
def benchmark_model(client, model_id, test_images, num_runs=100):
    # Warmup
    for img in test_images[:5]:
        client.predict(model_id=model_id, item=img)

    # Timed runs
    start = time.time()
    results = []
    for i in range(num_runs):
        img = test_images[i % len(test_images)]
        result = client.predict(model_id=model_id, item=img)
        results.append(result)
    elapsed = time.time() - start

    avg_latency = (elapsed / num_runs) * 1000  # ms
    throughput = num_runs / elapsed  # predictions/sec

    return avg_latency, throughput, results

# Compare original vs optimized
original_latency, original_throughput, original_results = benchmark_model(
    client, original_model.id, test_images
)

optimized_latency, optimized_throughput, optimized_results = benchmark_model(
    client, optimized_model.id, test_images
)

print(f"{'Metric':<20} {'Original':<15} {'Optimized':<15} {'Improvement':<15}")
print("-" * 65)
print(f"{'Latency (ms)':<20} {original_latency:<15.1f} {optimized_latency:<15.1f} {original_latency/optimized_latency:<15.1f}x")
print(f"{'Throughput (/s)':<20} {original_throughput:<15.1f} {optimized_throughput:<15.1f} {optimized_throughput/original_throughput:<15.1f}x")

Batch Inference

For bulk processing, use batch inference instead of single predictions:

# Single inference (slow for large volumes)
for image_path in all_images:
    result = client.predict(model_id=model.id, item=image_path)

# Batch inference (much faster)
results = client.predict_batch(
    model_id=model.id,
    file_paths=all_images,
    batch_size=32  # Process 32 at a time
)

Optimization Checklist

Use this checklist when preparing a model for production:

StepActionImpact
1Choose smallest architecture that meets accuracy requirementsHigh
2Convert to ONNX formatMedium
3Apply INT8 quantizationMedium-High
4Benchmark latency and throughput-
5Verify accuracy on validation set post-optimization-
6Configure auto-scaling for cloud deploymentMedium
7Set up monitoring for inference latency and errors-
8Plan retraining schedule-

Best Practices

  1. Measure before optimizing - Profile your model to find the actual bottleneck
  2. Don’t over-optimize - If 50ms latency is fine, you don’t need to squeeze to 5ms
  3. Test accuracy after every optimization - Quantization can hurt specific classes
  4. Use batch inference for offline processing - Single-item inference wastes throughput
  5. Match the format to the target - ONNX for servers, CoreML for iOS, TFLite for Android
  6. Consider the full pipeline - Model inference is often not the slowest part (network, preprocessing, postprocessing)

Related Topics