Train & Monitor

Train & Monitor

Start training your model and monitor progress in real-time.

Start Training

Monitor Training Progress

Real-Time Monitoring

import time

def monitor_training(client, job_id, poll_interval=30):
    """Monitor training job with live updates."""
    job = client.get_job(job_id)
    last_epoch = 0

    print(f"Training job: {job.name}")
    print(f"Status: {job.status}")
    print("-" * 60)
    print(f"{'Epoch':<8} {'Train Loss':<12} {'Val Loss':<12} {'Val Acc':<12}")
    print("-" * 60)

    while job.status in ["pending", "queued", "running"]:
        time.sleep(poll_interval)
        job = client.get_job(job_id)

        if job.metrics and job.metrics.get('epoch', 0) > last_epoch:
            last_epoch = job.metrics['epoch']
            print(f"{last_epoch:<8} "
                  f"{job.metrics.get('train_loss', 0):<12.4f} "
                  f"{job.metrics.get('val_loss', 0):<12.4f} "
                  f"{job.metrics.get('val_accuracy', 0):<12.2%}")

    print("-" * 60)
    print(f"Final status: {job.status}")

    if job.status == "completed":
        print(f"\nBest validation accuracy: {job.best_metrics['val_accuracy']:.2%}")
        print(f"Model ID: {job.model_id}")

    return job

# Monitor your job
job = monitor_training(client, job.id)

Training Dashboard

The web platform provides real-time visualizations:

  • Loss curves - Training and validation loss over epochs
  • Accuracy curves - Training and validation accuracy
  • Learning rate - Current learning rate (if using scheduler)
  • GPU utilization - Memory and compute usage
  • ETA - Estimated time to completion

Understanding Training Metrics

Loss

Measures how wrong the model’s predictions are. Lower is better.

MetricWhat It Means
Training lossError on training data
Validation lossError on held-out data

What to watch for:

Good training:
  Train loss ↓  Val loss ↓  (both decreasing)

Overfitting:
  Train loss ↓  Val loss ↑  (validation increasing while training decreases)

Underfitting:
  Train loss →  Val loss →  (both plateau early, high values)

Accuracy

Percentage of correct predictions.

MetricWhat It Means
Training accuracyCorrect predictions on training data
Validation accuracyCorrect predictions on held-out data

Learning Curves

graph LR
    subgraph "Healthy Training"
        A[Loss] --> B[Both curves
decrease together] end subgraph "Overfitting" C[Loss] --> D[Train ↓ Val ↑
Gap widens] end subgraph "Underfitting" E[Loss] --> F[Both plateau
high values] end

Handling Common Issues

Overfitting

Symptoms: Validation loss increases while training loss decreases.

Solutions:

# Add more regularization
config = {
    "dropout": 0.4,           # Increase dropout
    "weight_decay": 0.05,     # Stronger L2 regularization
    "label_smoothing": 0.2,   # Softer labels
    "early_stopping": True,
    "early_stopping_patience": 3  # Stop sooner
}

# More data augmentation
config["augmentation"] = {
    "horizontal_flip": True,
    "rotation": 30,
    "brightness": 0.3,
    "cutout": 0.5
}

# Freeze more layers
config["freeze_layers"] = "most"

Underfitting

Symptoms: Both losses plateau at high values early in training.

Solutions:

# Train longer, unfreeze more
config = {
    "epochs": 50,
    "freeze_layers": "none",    # Train all layers
    "learning_rate": 0.01       # Higher learning rate
}

# Use larger model
config["base_model"] = "efficientnet_b2"  # Instead of b0

Unstable Training

Symptoms: Loss jumps around wildly, doesn’t converge.

Solutions:

config = {
    "learning_rate": 0.0001,    # Lower learning rate
    "batch_size": 64,           # Larger batch for stable gradients
    "lr_scheduler": "cosine",   # Smooth decay
    "gradient_clipping": 1.0    # Prevent exploding gradients
}

Out of Memory

Symptoms: CUDA out of memory error.

Solutions:

config = {
    "batch_size": 16,           # Reduce batch size
    "gradient_accumulation": 4, # Accumulate over 4 steps (effective batch = 64)
    "mixed_precision": True     # Use FP16 to reduce memory
}

Checkpoints and Resuming

Save Checkpoints

config = {
    "save_checkpoints": True,
    "checkpoint_frequency": 5,      # Save every 5 epochs
    "keep_best_checkpoints": 3      # Keep top 3 by validation metric
}

Resume from Checkpoint

# Resume interrupted training
job = client.create_job(
    dataset_id=dataset.id,
    version_id=version.id,
    name="Product Classifier v1 (resumed)",
    job_type="finetune",
    config={
        **original_config,
        "resume_from_checkpoint": "job_abc123/checkpoint_epoch_15.pt"
    }
)

Load Best Checkpoint

# After training, get the best model
job = client.get_job(job_id)

if job.status == "completed":
    # Best model is automatically selected
    model = client.get_model(job.model_id)
    print(f"Best model: {model.id}")
    print(f"Best epoch: {job.best_epoch}")
    print(f"Best val accuracy: {job.best_metrics['val_accuracy']:.2%}")

View Training Logs

# Get detailed training logs
logs = client.get_job_logs(job_id=job.id)

for log in logs:
    print(f"[{log.timestamp}] {log.message}")

# Get metrics history
metrics = client.get_job_metrics(job_id=job.id)

for epoch_metrics in metrics:
    print(f"Epoch {epoch_metrics['epoch']}: "
          f"train_loss={epoch_metrics['train_loss']:.4f}, "
          f"val_loss={epoch_metrics['val_loss']:.4f}, "
          f"val_accuracy={epoch_metrics['val_accuracy']:.2%}")

Multiple Training Runs

Compare different configurations:

# Run multiple experiments
experiments = [
    {"name": "lr_high", "learning_rate": 0.01},
    {"name": "lr_medium", "learning_rate": 0.001},
    {"name": "lr_low", "learning_rate": 0.0001},
]

jobs = []
for exp in experiments:
    job = client.create_job(
        dataset_id=dataset.id,
        version_id=version.id,
        name=f"Experiment: {exp['name']}",
        config={
            "base_model": "efficientnet_b0",
            "epochs": 20,
            "batch_size": 32,
            **exp
        }
    )
    jobs.append(job)
    print(f"Started: {exp['name']} (job {job.id})")

# Wait for all to complete
for job in jobs:
    while client.get_job(job.id).status in ["pending", "running"]:
        time.sleep(60)

# Compare results
print("\nResults:")
print(f"{'Experiment':<20} {'Val Accuracy':<15} {'Val Loss':<15}")
print("-" * 50)
for job in jobs:
    job = client.get_job(job.id)
    print(f"{job.name:<20} "
          f"{job.best_metrics.get('val_accuracy', 0):<15.2%} "
          f"{job.best_metrics.get('val_loss', 0):<15.4f}")

Best Practices

  1. Watch validation loss - It’s your best indicator of generalization
  2. Use early stopping - Don’t waste time overtraining
  3. Save checkpoints - You may want to go back to an earlier epoch
  4. Log everything - You’ll thank yourself later when comparing runs
  5. Start with defaults - Tune hyperparameters one at a time
  6. Don’t trust training accuracy - Only validation metrics matter

Next Step

Once training completes, proceed to Evaluate Results to thoroughly assess your model’s performance.