Train & Monitor
Start training your model and monitor progress in real-time.
Start Training
Monitor Training Progress
Real-Time Monitoring
import time
def monitor_training(client, job_id, poll_interval=30):
"""Monitor training job with live updates."""
job = client.get_job(job_id)
last_epoch = 0
print(f"Training job: {job.name}")
print(f"Status: {job.status}")
print("-" * 60)
print(f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12} {'Val Acc':<12}")
print("-" * 60)
while job.status in ["pending", "queued", "running"]:
time.sleep(poll_interval)
job = client.get_job(job_id)
if job.metrics and job.metrics.get('epoch', 0) > last_epoch:
last_epoch = job.metrics['epoch']
print(f"{last_epoch:<8} "
f"{job.metrics.get('train_loss', 0):<12.4f} "
f"{job.metrics.get('val_loss', 0):<12.4f} "
f"{job.metrics.get('val_accuracy', 0):<12.2%}")
print("-" * 60)
print(f"Final status: {job.status}")
if job.status == "completed":
print(f"\nBest validation accuracy: {job.best_metrics['val_accuracy']:.2%}")
print(f"Model ID: {job.model_id}")
return job
# Monitor your job
job = monitor_training(client, job.id)Training Dashboard
The web platform provides real-time visualizations:
- Loss curves - Training and validation loss over epochs
- Accuracy curves - Training and validation accuracy
- Learning rate - Current learning rate (if using scheduler)
- GPU utilization - Memory and compute usage
- ETA - Estimated time to completion
Understanding Training Metrics
Loss
Measures how wrong the model’s predictions are. Lower is better.
| Metric | What It Means |
|---|---|
| Training loss | Error on training data |
| Validation loss | Error on held-out data |
What to watch for:
Good training:
Train loss ↓ Val loss ↓ (both decreasing)
Overfitting:
Train loss ↓ Val loss ↑ (validation increasing while training decreases)
Underfitting:
Train loss → Val loss → (both plateau early, high values)Accuracy
Percentage of correct predictions.
| Metric | What It Means |
|---|---|
| Training accuracy | Correct predictions on training data |
| Validation accuracy | Correct predictions on held-out data |
Learning Curves
graph LR
subgraph "Healthy Training"
A[Loss] --> B[Both curves
decrease together]
end
subgraph "Overfitting"
C[Loss] --> D[Train ↓ Val ↑
Gap widens]
end
subgraph "Underfitting"
E[Loss] --> F[Both plateau
high values]
endHandling Common Issues
Overfitting
Symptoms: Validation loss increases while training loss decreases.
Solutions:
# Add more regularization
config = {
"dropout": 0.4, # Increase dropout
"weight_decay": 0.05, # Stronger L2 regularization
"label_smoothing": 0.2, # Softer labels
"early_stopping": True,
"early_stopping_patience": 3 # Stop sooner
}
# More data augmentation
config["augmentation"] = {
"horizontal_flip": True,
"rotation": 30,
"brightness": 0.3,
"cutout": 0.5
}
# Freeze more layers
config["freeze_layers"] = "most"Underfitting
Symptoms: Both losses plateau at high values early in training.
Solutions:
# Train longer, unfreeze more
config = {
"epochs": 50,
"freeze_layers": "none", # Train all layers
"learning_rate": 0.01 # Higher learning rate
}
# Use larger model
config["base_model"] = "efficientnet_b2" # Instead of b0Unstable Training
Symptoms: Loss jumps around wildly, doesn’t converge.
Solutions:
config = {
"learning_rate": 0.0001, # Lower learning rate
"batch_size": 64, # Larger batch for stable gradients
"lr_scheduler": "cosine", # Smooth decay
"gradient_clipping": 1.0 # Prevent exploding gradients
}Out of Memory
Symptoms: CUDA out of memory error.
Solutions:
config = {
"batch_size": 16, # Reduce batch size
"gradient_accumulation": 4, # Accumulate over 4 steps (effective batch = 64)
"mixed_precision": True # Use FP16 to reduce memory
}Checkpoints and Resuming
Save Checkpoints
config = {
"save_checkpoints": True,
"checkpoint_frequency": 5, # Save every 5 epochs
"keep_best_checkpoints": 3 # Keep top 3 by validation metric
}Resume from Checkpoint
# Resume interrupted training
job = client.create_job(
dataset_id=dataset.id,
version_id=version.id,
name="Product Classifier v1 (resumed)",
job_type="finetune",
config={
**original_config,
"resume_from_checkpoint": "job_abc123/checkpoint_epoch_15.pt"
}
)Load Best Checkpoint
# After training, get the best model
job = client.get_job(job_id)
if job.status == "completed":
# Best model is automatically selected
model = client.get_model(job.model_id)
print(f"Best model: {model.id}")
print(f"Best epoch: {job.best_epoch}")
print(f"Best val accuracy: {job.best_metrics['val_accuracy']:.2%}")View Training Logs
# Get detailed training logs
logs = client.get_job_logs(job_id=job.id)
for log in logs:
print(f"[{log.timestamp}] {log.message}")
# Get metrics history
metrics = client.get_job_metrics(job_id=job.id)
for epoch_metrics in metrics:
print(f"Epoch {epoch_metrics['epoch']}: "
f"train_loss={epoch_metrics['train_loss']:.4f}, "
f"val_loss={epoch_metrics['val_loss']:.4f}, "
f"val_accuracy={epoch_metrics['val_accuracy']:.2%}")Multiple Training Runs
Compare different configurations:
# Run multiple experiments
experiments = [
{"name": "lr_high", "learning_rate": 0.01},
{"name": "lr_medium", "learning_rate": 0.001},
{"name": "lr_low", "learning_rate": 0.0001},
]
jobs = []
for exp in experiments:
job = client.create_job(
dataset_id=dataset.id,
version_id=version.id,
name=f"Experiment: {exp['name']}",
config={
"base_model": "efficientnet_b0",
"epochs": 20,
"batch_size": 32,
**exp
}
)
jobs.append(job)
print(f"Started: {exp['name']} (job {job.id})")
# Wait for all to complete
for job in jobs:
while client.get_job(job.id).status in ["pending", "running"]:
time.sleep(60)
# Compare results
print("\nResults:")
print(f"{'Experiment':<20} {'Val Accuracy':<15} {'Val Loss':<15}")
print("-" * 50)
for job in jobs:
job = client.get_job(job.id)
print(f"{job.name:<20} "
f"{job.best_metrics.get('val_accuracy', 0):<15.2%} "
f"{job.best_metrics.get('val_loss', 0):<15.4f}")Best Practices
- Watch validation loss - It’s your best indicator of generalization
- Use early stopping - Don’t waste time overtraining
- Save checkpoints - You may want to go back to an earlier epoch
- Log everything - You’ll thank yourself later when comparing runs
- Start with defaults - Tune hyperparameters one at a time
- Don’t trust training accuracy - Only validation metrics matter
Next Step
Once training completes, proceed to Evaluate Results to thoroughly assess your model’s performance.