Skip to content

Clinic 07

Checkpoint Roulette

Training finished. You have checkpoints from every epoch. The last epoch is not the best epoch, and the best training loss is not the best model. Pick the checkpoint you would actually ship.

Situation

15 Checkpoints, One Choice

Training loss keeps dropping across all 15 epochs. The validation curve does something different. Pick the checkpoint you would deploy and defend it on the held-out signal you used.

Your Job

Pick The Epoch

Choose the checkpoint, explain why it is not the last one, and say what signal you used to decide.

Bad Habit To Avoid

When The Headline Curve Lies

One curve says "ship the latest." A different curve disagrees. The disagreement is the clinic.

Situation

You trained a CNN for 15 epochs on a medical image classification task. The training log is in front of you.

The packet says:

  • training loss decreased steadily across all 15 epochs
  • validation loss improved until a point, then started rising
  • validation accuracy tells a slightly different story than validation loss
  • the task requires high sensitivity on the minority class

Artifact Packet

Read this packet before you decide:

epoch train loss val loss val accuracy val sensitivity (minority) val specificity
1 0.891 0.743 0.682 0.540 0.721
2 0.724 0.638 0.721 0.610 0.751
3 0.612 0.571 0.754 0.670 0.777
4 0.523 0.528 0.778 0.720 0.794
5 0.451 0.502 0.791 0.750 0.802
6 0.389 0.489 0.798 0.770 0.806
7 0.334 0.483 0.803 0.780 0.809
8 0.287 0.481 0.805 0.790 0.809
9 0.244 0.486 0.804 0.785 0.810
10 0.208 0.497 0.801 0.780 0.808
11 0.175 0.513 0.796 0.770 0.806
12 0.147 0.534 0.789 0.755 0.802
13 0.122 0.561 0.782 0.740 0.793
14 0.101 0.593 0.774 0.725 0.787
15 0.084 0.628 0.765 0.710 0.780

Decision Prompt

Write the note before you open the reveal.

Your note should answer:

  1. Which epoch checkpoint would you ship?
  2. What metric did you use to choose it, and why not training loss?
  3. What is the risk of shipping epoch 15?
  4. If sensitivity on the minority class is the priority, does that change your answer?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like

  • it picks an epoch in the 7-8 range based on validation metrics, not training loss
  • it explains that training loss is irrelevant for checkpoint selection because it always improves
  • it connects the rising validation loss after epoch 8 to overfitting
  • it notices that sensitivity peaks at epoch 8 (0.790) and uses that as the tiebreaker
  • it names the cost of shipping epoch 15: worst validation loss, worst sensitivity

Common Wrong Moves

  • picking epoch 15 because training loss is lowest
  • picking epoch 8 based on val loss alone without checking sensitivity
  • picking the highest val accuracy (epoch 8) without explaining why
  • ignoring the minority class sensitivity when the task explicitly requires it
  • picking epoch 5 or 6 as "safe" when epochs 7-8 are strictly better

Run The Clinic In Browser

Validate Your Decision In Browser

Reference Reveal

Open only after you write the note The reference choice is: - `selected_epoch = 8` - `selection_metric = val_sensitivity` Why: - epoch 8 has the best validation sensitivity (0.790), the best validation accuracy (0.805), and near-best validation loss (0.481) - after epoch 8, all validation metrics degrade while training loss keeps improving — classic overfitting - for a medical classification task, sensitivity on the minority class is the metric that matters most - epoch 15 has the worst validation performance across every metric despite the best training loss The practical lesson: checkpoint selection is a validation-metric decision, never a training-loss decision. The metric you select on should match the deployment priority.

What To Do Next

After this clinic:

  1. open PyTorch Training Loops
  2. run the matching training loop example
  3. use PyTorch Training Recipes for the full training-to-checkpoint workflow