After training, optimize your model for different deployment targets. This section covers format conversion, quantization, and performance tuning.
Why Optimize?
Goal
Optimization
Run on mobile
Convert to Core ML / TFLite
Reduce inference time
Quantization
Use with NVIDIA GPUs
TensorRT optimization
Interoperability
ONNX export
Reduce model size
Pruning, quantization
Export Formats
SeeMe.ai supports exporting to multiple formats:
graph TD
A[Trained Model] --> B[PyTorch]
A --> C[ONNX]
A --> D[Core ML]
A --> E[TensorFlow Lite]
A --> F[TensorRT]
C --> G[ONNX Runtime]
D --> H[iOS/macOS]
E --> I[Android/Edge]
F --> J[NVIDIA GPUs]
Using the Web Platform
Export to Different Formats
Navigate to Models > Your Model
Click Versions tab
Select the version to export
Click Export button
Choose format(s):
PyTorch (.pt)
ONNX (.onnx)
Core ML (.mlmodel)
TensorFlow Lite (.tflite)
TensorRT (.engine)
Apply Quantization
Quantization reduces model size and speeds up inference:
In the Export dialog, expand Optimization
Select quantization type:
Type
Size Reduction
Accuracy Impact
Best For
None
0%
None
Maximum accuracy
Dynamic
2-4x
Minimal
General use
Static (INT8)
4x
Small
Edge devices
FP16
2x
None
GPU inference
Click Export
Download Optimized Model
After export completes, find your model in Downloads
fromseemeimportClientclient=Client()model=client.get_model("your-model-id")version=model.versions[0]# Export to ONNXonnx_path=client.export_model(model_id=model.id,version_id=version.id,format="onnx",output_path="./model.onnx")print(f"Exported to: {onnx_path}")# Export to Core ML (iOS)coreml_path=client.export_model(model_id=model.id,version_id=version.id,format="coreml",output_path="./model.mlmodel")# Export to TensorFlow Lite (Android)tflite_path=client.export_model(model_id=model.id,version_id=version.id,format="tflite",output_path="./model.tflite")
Apply Quantization
# Export with INT8 quantization (smallest size)client.export_model(model_id=model.id,version_id=version.id,format="tflite",output_path="./model_int8.tflite",quantization="int8")# Export with FP16 quantization (GPU-optimized)client.export_model(model_id=model.id,version_id=version.id,format="onnx",output_path="./model_fp16.onnx",quantization="fp16")# Export with dynamic quantizationclient.export_model(model_id=model.id,version_id=version.id,format="onnx",output_path="./model_dynamic.onnx",quantization="dynamic")
# Export model to ONNX formatcurl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/export"\
-H "Authorization: myusername:my-api-key"\
-H "Content-Type: application/json"\
-d '{
"version_id": "your-version-id",
"format": "onnx"
}'# Export to CoreML (iOS)curl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/export"\
-H "Authorization: myusername:my-api-key"\
-H "Content-Type: application/json"\
-d '{
"version_id": "your-version-id",
"format": "coreml"
}'# Export to TFLite with INT8 quantizationcurl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/export"\
-H "Authorization: myusername:my-api-key"\
-H "Content-Type: application/json"\
-d '{
"version_id": "your-version-id",
"format": "tflite",
"quantization": "int8"
}'# Download exported modelcurl -O "https://api.seeme.ai/api/v1/models/{model_id}/versions/{version_id}/download?format=onnx"\
-H "Authorization: myusername:my-api-key"# Benchmark model performancecurl -X POST "https://api.seeme.ai/api/v1/models/{model_id}/benchmark"\
-H "Authorization: myusername:my-api-key"\
-H "Content-Type: application/json"\
-d '{
"version_id": "your-version-id",
"format": "onnx",
"num_runs": 100
}'
Format Comparison
Format
Platform
Pros
Cons
PyTorch
Python/Server
Full features, easy debugging
Large, Python-only
ONNX
Cross-platform
Portable, wide support
Some ops not supported
Core ML
iOS/macOS
Apple-optimized, Neural Engine
Apple-only
TFLite
Android/Edge
Small, efficient
Limited ops
TensorRT
NVIDIA GPUs
Fastest on NVIDIA
NVIDIA-only
Choosing the Right Format
graph TD
A{Where will you deploy?} --> B[iOS/macOS]
A --> C[Android]
A --> D[Web/Server]
A --> E[NVIDIA GPU]
A --> F[Edge Device]
B --> G[Core ML]
C --> H[TensorFlow Lite]
D --> I[ONNX]
E --> J[TensorRT]
F --> H
Quantization Deep Dive
What is Quantization?
Quantization converts model weights from 32-bit floats to smaller data types:
Type
Bits
Size
Speed
Accuracy
FP32
32
100%
1x
Baseline
FP16
16
50%
1.5-2x
Same
INT8
8
25%
2-4x
Slight loss
When to Use Each Type
FP16 Half Precision
Best for: GPU inference
# FP16 maintains accuracy while halving model sizeclient.export_model(model_id=model.id,version_id=version.id,format="onnx",quantization="fp16",output_path="./model_fp16.onnx")
Pros:
No accuracy loss for most models
2x smaller model
Faster on modern GPUs
Cons:
Not all CPUs support FP16 natively
INT8 Quantization
Best for: Edge devices, mobile
# INT8 requires calibration data for best resultsclient.export_model(model_id=model.id,version_id=version.id,format="tflite",quantization="int8",calibration_dataset_id=dataset.id,# For calibrationoutput_path="./model_int8.tflite")
Pros:
4x smaller model
2-4x faster inference
Works on all CPUs
Cons:
Small accuracy loss (typically <1%)
Needs calibration for best results
Dynamic Quantization
Best for: Quick optimization without calibration
# Dynamic quantization is applied at runtimeclient.export_model(model_id=model.id,version_id=version.id,format="onnx",quantization="dynamic",output_path="./model_dynamic.onnx")
Pros:
No calibration needed
Good balance of size/speed/accuracy
Cons:
Less optimal than static INT8
Model Pruning
Remove unnecessary weights to reduce model size:
# Prune model (remove sparse weights)pruned_model=client.prune_model(model_id=model.id,version_id=version.id,sparsity=0.5,# Remove 50% of weightsoutput_path="./model_pruned.onnx")print(f"Original size: {pruned_model.original_size_mb:.1f}MB")print(f"Pruned size: {pruned_model.pruned_size_mb:.1f}MB")print(f"Accuracy impact: {pruned_model.accuracy_delta:.2%}")
Optimization Checklist
Before deploying, verify your optimized model:
Correct output format for target platform
Accuracy acceptable after quantization
Inference speed meets requirements
Model size fits deployment constraints
Input/output shapes match application
Test Optimized Model
# Test the optimized model locallyimportonnxruntimeasortimportnumpyasnpfromPILimportImage# Load the optimized modelsession=ort.InferenceSession("./model.onnx")# Prepare inputimage=Image.open("test.jpg").resize((224,224))input_array=np.array(image).astype(np.float32)input_array=input_array.transpose(2,0,1)# HWC -> CHWinput_array=input_array[np.newaxis,...]# Add batch dimension# Run inferenceoutputs=session.run(None,{"input":input_array})predictions=outputs[0]# Get top predictiontop_idx=np.argmax(predictions[0])confidence=predictions[0][top_idx]print(f"Prediction: {top_idx}, Confidence: {confidence:.2%}")