Optimize Your Model

After training, optimize your model for different deployment targets. This section covers format conversion, quantization, and performance tuning.

Why Optimize?

Goal	Optimization
Run on mobile	Convert to Core ML / TFLite
Reduce inference time	Quantization
Use with NVIDIA GPUs	TensorRT optimization
Interoperability	ONNX export
Reduce model size	Pruning, quantization

Export Formats

SeeMe.ai supports exporting to multiple formats:

graph TD
    A[Trained Model] --> B[PyTorch]
    A --> C[ONNX]
    A --> D[Core ML]
    A --> E[TensorFlow Lite]
    A --> F[TensorRT]
    C --> G[ONNX Runtime]
    D --> H[iOS/macOS]
    E --> I[Android/Edge]
    F --> J[NVIDIA GPUs]

Using the Web Platform

Apply Quantization

Quantization reduces model size and speeds up inference:

In the Export dialog, expand Optimization
Select quantization type:

Type	Size Reduction	Accuracy Impact	Best For
None	0%	None	Maximum accuracy
Dynamic	2-4x	Minimal	General use
Static (INT8)	4x	Small	Edge devices
FP16	2x	None	GPU inference

Click Export

Quantization Options

Download Optimized Model

After export completes, find your model in Downloads
Click to download the optimized file
Or copy the direct download URL for automation

Download via API:

curl -O "https://api.seeme.ai/api/v1/models/{model_id}/versions/{version_id}/download?format=onnx" \
  -H "Authorization: myusername:my-api-key"

Using the Python SDK

Export to Different Formats

from seeme import Client

client = Client()
model = client.get_model("your-model-id")
version = model.versions[0]

# Export to ONNX
onnx_path = client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="onnx",
    output_path="./model.onnx"
)
print(f"Exported to: {onnx_path}")

# Export to Core ML (iOS)
coreml_path = client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="coreml",
    output_path="./model.mlmodel"
)

# Export to TensorFlow Lite (Android)
tflite_path = client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="tflite",
    output_path="./model.tflite"
)

Apply Quantization

# Export with INT8 quantization (smallest size)
client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="tflite",
    output_path="./model_int8.tflite",
    quantization="int8"
)

# Export with FP16 quantization (GPU-optimized)
client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="onnx",
    output_path="./model_fp16.onnx",
    quantization="fp16"
)

# Export with dynamic quantization
client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="onnx",
    output_path="./model_dynamic.onnx",
    quantization="dynamic"
)

Benchmark Performance

# Benchmark inference speed
benchmark = client.benchmark_model(
    model_id=model.id,
    version_id=version.id,
    format="onnx",
    num_runs=100
)

print(f"Average inference time: {benchmark.avg_ms:.2f}ms")
print(f"P50 latency: {benchmark.p50_ms:.2f}ms")
print(f"P99 latency: {benchmark.p99_ms:.2f}ms")
print(f"Throughput: {benchmark.throughput_per_sec:.1f} inferences/sec")

REST API

# Export model to ONNX format
curl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/export" \
  -H "Authorization: myusername:my-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "version_id": "your-version-id",
    "format": "onnx"
  }'

# Export to CoreML (iOS)
curl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/export" \
  -H "Authorization: myusername:my-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "version_id": "your-version-id",
    "format": "coreml"
  }'

# Export to TFLite with INT8 quantization
curl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/export" \
  -H "Authorization: myusername:my-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "version_id": "your-version-id",
    "format": "tflite",
    "quantization": "int8"
  }'

# Download exported model
curl -O "https://api.seeme.ai/api/v1/models/{model_id}/versions/{version_id}/download?format=onnx" \
  -H "Authorization: myusername:my-api-key"

# Benchmark model performance
curl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/benchmark" \
  -H "Authorization: myusername:my-api-key" \
  -H "Content-Type: application/json" \
  -d '{
    "version_id": "your-version-id",
    "format": "onnx",
    "num_runs": 100
  }'

Format Comparison

Format	Platform	Pros	Cons
PyTorch	Python/Server	Full features, easy debugging	Large, Python-only
ONNX	Cross-platform	Portable, wide support	Some ops not supported
Core ML	iOS/macOS	Apple-optimized, Neural Engine	Apple-only
TFLite	Android/Edge	Small, efficient	Limited ops
TensorRT	NVIDIA GPUs	Fastest on NVIDIA	NVIDIA-only

Choosing the Right Format

graph TD
    A{Where will you deploy?} --> B[iOS/macOS]
    A --> C[Android]
    A --> D[Web/Server]
    A --> E[NVIDIA GPU]
    A --> F[Edge Device]

    B --> G[Core ML]
    C --> H[TensorFlow Lite]
    D --> I[ONNX]
    E --> J[TensorRT]
    F --> H

Quantization Deep Dive

What is Quantization?

Quantization converts model weights from 32-bit floats to smaller data types:

Type	Bits	Size	Speed	Accuracy
FP32	32	100%	1x	Baseline
FP16	16	50%	1.5-2x	Same
INT8	8	25%	2-4x	Slight loss

When to Use Each Type

FP16 Half Precision

Best for: GPU inference

# FP16 maintains accuracy while halving model size
client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="onnx",
    quantization="fp16",
    output_path="./model_fp16.onnx"
)

Pros:

No accuracy loss for most models
2x smaller model
Faster on modern GPUs

Cons:

Not all CPUs support FP16 natively

INT8 Quantization

Best for: Edge devices, mobile

# INT8 requires calibration data for best results
client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="tflite",
    quantization="int8",
    calibration_dataset_id=dataset.id,  # For calibration
    output_path="./model_int8.tflite"
)

Pros:

4x smaller model
2-4x faster inference
Works on all CPUs

Cons:

Small accuracy loss (typically <1%)
Needs calibration for best results

Dynamic Quantization

Best for: Quick optimization without calibration

# Dynamic quantization is applied at runtime
client.export_model(
    model_id=model.id,
    version_id=version.id,
    format="onnx",
    quantization="dynamic",
    output_path="./model_dynamic.onnx"
)

Pros:

No calibration needed
Good balance of size/speed/accuracy

Cons:

Less optimal than static INT8

Model Pruning

Remove unnecessary weights to reduce model size:

# Prune model (remove sparse weights)
pruned_model = client.prune_model(
    model_id=model.id,
    version_id=version.id,
    sparsity=0.5,  # Remove 50% of weights
    output_path="./model_pruned.onnx"
)

print(f"Original size: {pruned_model.original_size_mb:.1f}MB")
print(f"Pruned size: {pruned_model.pruned_size_mb:.1f}MB")
print(f"Accuracy impact: {pruned_model.accuracy_delta:.2%}")

Optimization Checklist

Before deploying, verify your optimized model:

Correct output format for target platform
Accuracy acceptable after quantization
Inference speed meets requirements
Model size fits deployment constraints
Input/output shapes match application

Test Optimized Model

# Test the optimized model locally
import onnxruntime as ort
import numpy as np
from PIL import Image

# Load the optimized model
session = ort.InferenceSession("./model.onnx")

# Prepare input
image = Image.open("test.jpg").resize((224, 224))
input_array = np.array(image).astype(np.float32)
input_array = input_array.transpose(2, 0, 1)  # HWC -> CHW
input_array = input_array[np.newaxis, ...]     # Add batch dimension

# Run inference
outputs = session.run(None, {"input": input_array})
predictions = outputs[0]

# Get top prediction
top_idx = np.argmax(predictions[0])
confidence = predictions[0][top_idx]
print(f"Prediction: {top_idx}, Confidence: {confidence:.2%}")

Next Step

4. Deploy →

Train Your Image Classification Model Deploy Your Model