Optimize Your Model

Optimize Your Model

After training, optimize your model for different deployment targets. This section covers format conversion, quantization, and performance tuning.

Why Optimize?

GoalOptimization
Run on mobileConvert to Core ML / TFLite
Reduce inference timeQuantization
Use with NVIDIA GPUsTensorRT optimization
InteroperabilityONNX export
Reduce model sizePruning, quantization

Export Formats

SeeMe.ai supports exporting to multiple formats:

graph TD
    A[Trained Model] --> B[PyTorch]
    A --> C[ONNX]
    A --> D[Core ML]
    A --> E[TensorFlow Lite]
    A --> F[TensorRT]
    C --> G[ONNX Runtime]
    D --> H[iOS/macOS]
    E --> I[Android/Edge]
    F --> J[NVIDIA GPUs]

Using the Web Platform

Using the Python SDK

Format Comparison

FormatPlatformProsCons
PyTorchPython/ServerFull features, easy debuggingLarge, Python-only
ONNXCross-platformPortable, wide supportSome ops not supported
Core MLiOS/macOSApple-optimized, Neural EngineApple-only
TFLiteAndroid/EdgeSmall, efficientLimited ops
TensorRTNVIDIA GPUsFastest on NVIDIANVIDIA-only

Choosing the Right Format

graph TD
    A{Where will you deploy?} --> B[iOS/macOS]
    A --> C[Android]
    A --> D[Web/Server]
    A --> E[NVIDIA GPU]
    A --> F[Edge Device]

    B --> G[Core ML]
    C --> H[TensorFlow Lite]
    D --> I[ONNX]
    E --> J[TensorRT]
    F --> H

Quantization Deep Dive

What is Quantization?

Quantization converts model weights from 32-bit floats to smaller data types:

TypeBitsSizeSpeedAccuracy
FP3232100%1xBaseline
FP161650%1.5-2xSame
INT8825%2-4xSlight loss

When to Use Each Type

Model Pruning

Remove unnecessary weights to reduce model size:

# Prune model (remove sparse weights)
pruned_model = client.prune_model(
    model_id=model.id,
    version_id=version.id,
    sparsity=0.5,  # Remove 50% of weights
    output_path="./model_pruned.onnx"
)

print(f"Original size: {pruned_model.original_size_mb:.1f}MB")
print(f"Pruned size: {pruned_model.pruned_size_mb:.1f}MB")
print(f"Accuracy impact: {pruned_model.accuracy_delta:.2%}")

Optimization Checklist

Before deploying, verify your optimized model:

  • Correct output format for target platform
  • Accuracy acceptable after quantization
  • Inference speed meets requirements
  • Model size fits deployment constraints
  • Input/output shapes match application

Test Optimized Model

# Test the optimized model locally
import onnxruntime as ort
import numpy as np
from PIL import Image

# Load the optimized model
session = ort.InferenceSession("./model.onnx")

# Prepare input
image = Image.open("test.jpg").resize((224, 224))
input_array = np.array(image).astype(np.float32)
input_array = input_array.transpose(2, 0, 1)  # HWC -> CHW
input_array = input_array[np.newaxis, ...]     # Add batch dimension

# Run inference
outputs = session.run(None, {"input": input_array})
predictions = outputs[0]

# Get top prediction
top_idx = np.argmax(predictions[0])
confidence = predictions[0][top_idx]
print(f"Prediction: {top_idx}, Confidence: {confidence:.2%}")

Next Step

4. Deploy →