Clinic 11
Augmentation Choice
Four augmentation recipes, four validation curves, one weak slice telling a different story. Pick the augmentation you would ship and defend why.
Situation
Four Recipes, One Choice
You tried four augmentation stacks for an image classifier. Validation scores are close. The weak slice is not.
Your Job
Pick For The Weak Slice
Pick the augmentation recipe you would ship, decide whether to iterate, and name what evidence would change your mind.
Bad Habit To Avoid
More Augmentation = Better
If the reasoning is "stronger augmentation always wins," the clinic failed.
Situation¶
You are training an image classifier on a medical imaging task. The dataset has:
- ~8,000 training images across 5 classes
- a rare 4th class ("late-stage pathology") with ~400 training images
- a validation split that preserves class ratios
- a known weak slice: images acquired on one of the two scanner types, where the pathology appears dimmer
You have four augmentation recipes and matching curves:
Artifact Packet¶
Read the packet before you decide:
| recipe | validation macro-F1 | class-4 recall | dim-scanner slice recall | training stability | notes |
|---|---|---|---|---|---|
no_augmentation |
0.782 | 0.61 | 0.44 | smooth | overfits after epoch 15 |
standard_flips_crops |
0.798 | 0.68 | 0.52 | smooth | standard ImageNet-style recipe |
heavy_color_and_cutout |
0.810 | 0.66 | 0.49 | slightly noisy | strong color jitter + CutMix |
shift_simulating_rotations_blur |
0.803 | 0.72 | 0.74 | smooth | rotations + blur match the acquisition shift |
The tempting move is obvious: heavy_color_and_cutout wins the average macro-F1.
The harder question is which recipe makes the model behave better on the slice that actually matters — the dim-scanner images. The blur-and-rotation recipe is behind on average but 22 points ahead on the failing slice.
Decision Prompt¶
Write the note before you open the reveal.
Your note should answer:
- Which augmentation recipe would you ship right now?
- Would you stop or iterate, and if iterating, what would you target?
- Which single number in the packet drove the decision?
- What evidence would change your mind?
Keep the note short. Four to six sentences is enough.
Strong Reasoning Looks Like¶
- it names the weak slice as the real constraint, not the average macro-F1
- it recognizes that the augmentations in
shift_simulating_rotations_blurare shaped like the acquisition shift, not like generic regularization - it treats the 22-point gap on the slice as much stronger evidence than the 1-point gap on the overall metric
- it does not reflexively prefer "more aggressive" augmentation — strength is not the axis that matters
- it says what one more probe (scanner-stratified cross-validation, or a held-out scanner) would target
Common Wrong Moves¶
- picking
heavy_color_and_cutoutbecause it has the best overall score - picking
no_augmentationbecause it is "simpler" when the overfitting after epoch 15 is visible - adding all four recipes on top of each other to hedge
- saying "augmentation is task-specific" without naming the shift
- ignoring the slice column because it looks smaller than the headline metric
Run The Clinic In Browser¶
Use the browser runner as a scratchpad while you write your note.
Reference Reveal¶
Open only after you write the note
The reference choice is: - `selected_recipe = shift_simulating_rotations_blur` - `decision = ship, then iterate on scanner-stratified evaluation` Why: - the average macro-F1 gap between recipes is small (1 to 3 points) - the slice-recall gap between recipes is very large (22 points) - the slice is not a fairness footnote — it is where the model will fail at deployment - rotations and blur simulate the shift from the dim scanner, so the augmentation carries the right inductive bias for this task - `heavy_color_and_cutout` is the wrong shape of augmentation for an acquisition shift; it regularizes, but not along the axis that matters The practical lesson is not "always pick the slice winner". The lesson is: pick the augmentation whose *shape* matches the shift the model is failing on. When no augmentation matches, invest in data collection before stronger recipes.What To Do Next¶
After this clinic:
- open Data Augmentation
- open Vision Augmentation and Shift Robustness
- use Vision and Audio Workflows for the full track where the augmentation decision is wired into a defended workflow