Skip to content

Deep Learning and Checkpoints

This pack is about training dynamics, overfitting signals, transfer choices, and checkpoint discipline. The questions are adapted from public official course materials and rewritten into academy form.

Use this cold. Read each question, write your own answer in a notebook or scratch file, then expand the reveal and compare.

QD01. Learning Rate Too High

Question: During training, loss jumps up and down violently and sometimes becomes nan after a few updates. What is the first optimizer diagnosis?

Reveal Academy Answer
  • The learning rate is probably too high.
  • Large steps can overshoot repeatedly and destabilize training.
  • The first corrective move is to reduce the learning rate and confirm whether the loss curve becomes smoother.

Common wrong answer: "the data must be bad — clean it first." Bad data does not usually produce instant nan after a few updates; LR-too-high does.

Why it matters: Many "mysterious" training failures are just step-size failures.

Source family: Stanford CS231n schedule and Stanford CS229 optimization and regularization themes

QD02. Low Train Loss, High Validation Loss

Question: Training loss keeps improving, validation loss stays much worse, and the gap widens. What is the first diagnosis?

Reveal Academy Answer
  • High variance or overfitting.
  • The model is fitting the training distribution more aggressively than the held-out evidence supports.
  • First responses include stronger regularization, earlier stopping, more data, or less model capacity.

Common wrong answer: "add more capacity." More capacity makes the gap worse when it is already a variance problem.

Why it matters: The train-valid gap is one of the core deep-learning readouts.

Source family: Stanford CS229 regularization/model-selection notes and UC Berkeley CS189 bias-variance themes

QD03. Which Checkpoint Do You Keep?

Question: Validation accuracy peaks at epoch 7, then drifts downward, but training accuracy keeps climbing until epoch 25. Which checkpoint should you keep?

Reveal Academy Answer
  • Keep the best-validation checkpoint.
  • Model selection should follow the held-out metric tied to deployment, not the final epoch by default.
  • The later epochs are evidence of additional fit to the training set, not better generalization.

Common wrong answer: "the final epoch — it has the lowest training loss." Training loss is not the selection signal; held-out metric is.

Why it matters: Checkpoint choice is part of the evaluation protocol.

Source family: Stanford CS231n assignments and Stanford CS229 regularization/model-selection notes

QD04. Frozen Head, Partial Unfreeze, Or Full Fine-Tune?

Question: You have 2,000 labeled images and a good pretrained backbone from a related visual domain. What is the safest first sequence: full fine-tune immediately, or start smaller?

Reveal Academy Answer
  • Start smaller.
  • The safest first sequence is usually: train a new head on frozen features, then unfreeze selectively if validation evidence supports it.
  • Full fine-tuning can help, but it raises optimization and overfitting risk.

Common wrong answer: "always full fine-tune — it has the highest ceiling." It also has the highest variance and the highest forget-the-pretraining risk on small data.

Why it matters: Transfer learning works best when escalation is earned, not assumed.

Source family: Stanford CS231n assignments and Stanford CS229 model-selection notes

QD05. Weight Decay Or More Capacity?

Question: An MLP already fits the training set extremely well but generalizes poorly. Should the next move usually be a larger network or stronger regularization?

Reveal Academy Answer
  • Usually stronger regularization.
  • If training fit is already strong, more capacity often makes variance worse.
  • Weight decay, dropout, earlier stopping, or data augmentation are more sensible first responses.

Common wrong answer: "add a layer — bigger is better." Adding capacity to a model that is already overfitting widens the train/val gap.

Verify before regularizing: confirm the failure is variance (high train, lower val) rather than bias (both low) — only then is regularization the right move.

Why it matters: The next move should respond to the failure mode, not just add power.

Source family: Stanford CS231n schedule and Stanford CS229 regularization/model-selection notes

QD06. Small-Batch Fine-Tuning And BatchNorm

Question: You are fine-tuning a pretrained vision network with very small batches. Is it safer to aggressively relearn every BatchNorm statistic immediately, or to be conservative?

Reveal Academy Answer
  • Be conservative first.
  • Very small batches produce noisy BatchNorm estimates that can poison the running stats inherited from pretraining.
  • The concrete safe move is one of: keep BN modules in eval() mode so they use pretrained running stats, optionally freeze BN affine parameters with requires_grad=False, or swap BN for GroupNorm / LayerNorm which do not depend on batch size. Do not use track_running_stats=False as a freeze switch; that makes BN use batch statistics. Then test the rest of the adaptation plan under a fixed validation rule.

Common wrong answer: "leave BN in train mode and let it adapt." On batches of size 4–8 the running statistics drift unstably and the model degrades faster than it adapts.

Why it matters: Some fine-tuning failures come from unstable normalization rather than bad representation.

Source family: Stanford CS231n assignments

QD07. Better Final Train Loss, Worse Validation Curve

Question: Run A ends with lower training loss than Run B, but Run B achieves the best validation metric at any checkpoint. Which run is better?

Reveal Academy Answer
  • Run B.
  • The best deployment candidate is the run that wins on the held-out metric you actually care about.
  • Final training loss is secondary if it does not translate into held-out performance.

Common wrong answer: "Run A converged better." Convergence to a lower training loss is not generalization; only the held-out metric is.

Why it matters: This is the same discipline as classical model selection, just inside a deeper training loop.

Source family: Stanford CS229 model-selection notes and Stanford CS231n assignments

QD08. Augmentation Helped Training, Hurt Deployment Slice

Question: A stronger augmentation recipe improves average validation accuracy slightly but hurts performance on a critical real deployment slice. Should it stay by default?

Reveal Academy Answer
  • No.
  • The average gain is relevant, but the deployment slice can still dominate the choice.
  • Keep the split and slice definition fixed, then decide based on the real operating objective rather than average performance alone.

Common wrong answer: "average accuracy went up, ship it." The deployment slice is the actual cost function; the average is a proxy.

Why it matters: A training trick is only useful if it helps the deployment problem you actually have.

Source family: Stanford CS231n schedule and Stanford CS229 model-selection notes

What To Do After This Pack

If this pack exposed a gap, route back into: