Clinic 07

Checkpoint Roulette

Training finished. You have checkpoints from every epoch. The last epoch is not the best epoch, and the best training loss is not the best model. Pick the checkpoint you would actually ship.

Back To Clinics Open Training Loops Topic Open The Full Track

Situation

15 Checkpoints, One Choice

Training loss keeps dropping across all 15 epochs. The validation curve does something different. Pick the checkpoint you would deploy and defend it on the held-out signal you used.

Your Job

Pick The Epoch

Choose the checkpoint, explain why it is not the last one, and say what signal you used to decide.

Bad Habit To Avoid

When The Headline Curve Lies

One curve says "ship the latest." A different curve disagrees. The disagreement is the clinic.

Situation¶

You trained a CNN for 15 epochs on a medical image classification task. The training log is in front of you.

The packet says:

training loss decreased steadily across all 15 epochs
validation loss improved until a point, then started rising
validation accuracy tells a slightly different story than validation loss
the task requires high sensitivity on the minority class

Artifact Packet¶

Read this packet before you decide:

epoch	train loss	val loss	val accuracy	val sensitivity (minority)	val specificity
1	0.891	0.743	0.682	0.540	0.721
2	0.724	0.638	0.721	0.610	0.751
3	0.612	0.571	0.754	0.670	0.777
4	0.523	0.528	0.778	0.720	0.794
5	0.451	0.502	0.791	0.750	0.802
6	0.389	0.489	0.798	0.770	0.806
7	0.334	0.483	0.803	0.780	0.809
8	0.287	0.481	0.805	0.790	0.809
9	0.244	0.486	0.804	0.785	0.810
10	0.208	0.497	0.801	0.780	0.808
11	0.175	0.513	0.796	0.770	0.806
12	0.147	0.534	0.789	0.755	0.802
13	0.122	0.561	0.782	0.740	0.793
14	0.101	0.593	0.774	0.725	0.787
15	0.084	0.628	0.765	0.710	0.780

Decision Prompt¶

Write the note before you open the reveal.

Your note should answer:

Which epoch checkpoint would you ship?
What metric did you use to choose it, and why not training loss?
What is the risk of shipping epoch 15?
If sensitivity on the minority class is the priority, does that change your answer?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like¶

it picks an epoch in the 7-8 range based on validation metrics, not training loss
it explains that training loss is irrelevant for checkpoint selection because it always improves
it connects the rising validation loss after epoch 8 to overfitting
it notices that sensitivity peaks at epoch 8 (0.790) and uses that as the tiebreaker
it names the cost of shipping epoch 15: worst validation loss, worst sensitivity

Common Wrong Moves¶

picking epoch 15 because training loss is lowest
picking epoch 8 based on val loss alone without checking sensitivity
picking the highest val accuracy (epoch 8) without explaining why
ignoring the minority class sensitivity when the task explicitly requires it
picking epoch 5 or 6 as "safe" when epochs 7-8 are strictly better

Run The Clinic In Browser¶

Validate Your Decision In Browser¶

Reference Reveal¶

Open only after you write the note

The reference choice is: - `selected_epoch = 8` - `selection_metric = val_sensitivity` Why: - epoch 8 has the best validation sensitivity (0.790), the best validation accuracy (0.805), and near-best validation loss (0.481) - after epoch 8, all validation metrics degrade while training loss keeps improving — classic overfitting - for a medical classification task, sensitivity on the minority class is the metric that matters most - epoch 15 has the worst validation performance across every metric despite the best training loss The practical lesson: checkpoint selection is a validation-metric decision, never a training-loss decision. The metric you select on should match the deployment priority.

What To Do Next¶

After this clinic:

open PyTorch Training Loops
run the matching training loop example
use PyTorch Training Recipes for the full training-to-checkpoint workflow