Clinic 07
Checkpoint Roulette
Training finished. You have checkpoints from every epoch. The last epoch is not the best epoch, and the best training loss is not the best model. Pick the checkpoint you would actually ship.
Situation
15 Checkpoints, One Choice
Training loss keeps dropping across all 15 epochs. The validation curve does something different. Pick the checkpoint you would deploy and defend it on the held-out signal you used.
Your Job
Pick The Epoch
Choose the checkpoint, explain why it is not the last one, and say what signal you used to decide.
Bad Habit To Avoid
When The Headline Curve Lies
One curve says "ship the latest." A different curve disagrees. The disagreement is the clinic.
Situation¶
You trained a CNN for 15 epochs on a medical image classification task. The training log is in front of you.
The packet says:
- training loss decreased steadily across all 15 epochs
- validation loss improved until a point, then started rising
- validation accuracy tells a slightly different story than validation loss
- the task requires high sensitivity on the minority class
Artifact Packet¶
Read this packet before you decide:
| epoch | train loss | val loss | val accuracy | val sensitivity (minority) | val specificity |
|---|---|---|---|---|---|
| 1 | 0.891 | 0.743 | 0.682 | 0.540 | 0.721 |
| 2 | 0.724 | 0.638 | 0.721 | 0.610 | 0.751 |
| 3 | 0.612 | 0.571 | 0.754 | 0.670 | 0.777 |
| 4 | 0.523 | 0.528 | 0.778 | 0.720 | 0.794 |
| 5 | 0.451 | 0.502 | 0.791 | 0.750 | 0.802 |
| 6 | 0.389 | 0.489 | 0.798 | 0.770 | 0.806 |
| 7 | 0.334 | 0.483 | 0.803 | 0.780 | 0.809 |
| 8 | 0.287 | 0.481 | 0.805 | 0.790 | 0.809 |
| 9 | 0.244 | 0.486 | 0.804 | 0.785 | 0.810 |
| 10 | 0.208 | 0.497 | 0.801 | 0.780 | 0.808 |
| 11 | 0.175 | 0.513 | 0.796 | 0.770 | 0.806 |
| 12 | 0.147 | 0.534 | 0.789 | 0.755 | 0.802 |
| 13 | 0.122 | 0.561 | 0.782 | 0.740 | 0.793 |
| 14 | 0.101 | 0.593 | 0.774 | 0.725 | 0.787 |
| 15 | 0.084 | 0.628 | 0.765 | 0.710 | 0.780 |
Decision Prompt¶
Write the note before you open the reveal.
Your note should answer:
- Which epoch checkpoint would you ship?
- What metric did you use to choose it, and why not training loss?
- What is the risk of shipping epoch 15?
- If sensitivity on the minority class is the priority, does that change your answer?
Keep the note short. Four to six sentences is enough.
Strong Reasoning Looks Like¶
- it picks an epoch in the 7-8 range based on validation metrics, not training loss
- it explains that training loss is irrelevant for checkpoint selection because it always improves
- it connects the rising validation loss after epoch 8 to overfitting
- it notices that sensitivity peaks at epoch 8 (0.790) and uses that as the tiebreaker
- it names the cost of shipping epoch 15: worst validation loss, worst sensitivity
Common Wrong Moves¶
- picking epoch 15 because training loss is lowest
- picking epoch 8 based on val loss alone without checking sensitivity
- picking the highest val accuracy (epoch 8) without explaining why
- ignoring the minority class sensitivity when the task explicitly requires it
- picking epoch 5 or 6 as "safe" when epochs 7-8 are strictly better
Run The Clinic In Browser¶
Validate Your Decision In Browser¶
Reference Reveal¶
Open only after you write the note
The reference choice is: - `selected_epoch = 8` - `selection_metric = val_sensitivity` Why: - epoch 8 has the best validation sensitivity (0.790), the best validation accuracy (0.805), and near-best validation loss (0.481) - after epoch 8, all validation metrics degrade while training loss keeps improving — classic overfitting - for a medical classification task, sensitivity on the minority class is the metric that matters most - epoch 15 has the worst validation performance across every metric despite the best training loss The practical lesson: checkpoint selection is a validation-metric decision, never a training-loss decision. The metric you select on should match the deployment priority.What To Do Next¶
After this clinic:
- open PyTorch Training Loops
- run the matching training loop example
- use PyTorch Training Recipes for the full training-to-checkpoint workflow