Clinic 11

Augmentation Choice

Four augmentation recipes, four validation curves, one weak slice telling a different story. Pick the augmentation you would ship and defend why.

Back To Clinics Open Augmentation Topic Open The Full Track

Situation

Four Recipes, One Choice

You tried four augmentation stacks for an image classifier. Validation scores are close. The weak slice is not.

Your Job

Pick For The Weak Slice

Pick the augmentation recipe you would ship, decide whether to iterate, and name what evidence would change your mind.

Bad Habit To Avoid

More Augmentation = Better

If the reasoning is "stronger augmentation always wins," the clinic failed.

Situation¶

You are training an image classifier on a medical imaging task. The dataset has:

~8,000 training images across 5 classes
a rare 4th class ("late-stage pathology") with ~400 training images
a validation split that preserves class ratios
a known weak slice: images acquired on one of the two scanner types, where the pathology appears dimmer

You have four augmentation recipes and matching curves:

Artifact Packet¶

Read the packet before you decide:

recipe	validation macro-F1	class-4 recall	dim-scanner slice recall	training stability	notes
`no_augmentation`	0.782	0.61	0.44	smooth	overfits after epoch 15
`standard_flips_crops`	0.798	0.68	0.52	smooth	standard ImageNet-style recipe
`heavy_color_and_cutout`	0.810	0.66	0.49	slightly noisy	strong color jitter + CutMix
`shift_simulating_rotations_blur`	0.803	0.72	0.74	smooth	rotations + blur match the acquisition shift

The tempting move is obvious: heavy_color_and_cutout wins the average macro-F1.

The harder question is which recipe makes the model behave better on the slice that actually matters — the dim-scanner images. The blur-and-rotation recipe is behind on average but 22 points ahead on the failing slice.

Decision Prompt¶

Write the note before you open the reveal.

Your note should answer:

Which augmentation recipe would you ship right now?
Would you stop or iterate, and if iterating, what would you target?
Which single number in the packet drove the decision?
What evidence would change your mind?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like¶

it names the weak slice as the real constraint, not the average macro-F1
it recognizes that the augmentations in shift_simulating_rotations_blur are shaped like the acquisition shift, not like generic regularization
it treats the 22-point gap on the slice as much stronger evidence than the 1-point gap on the overall metric
it does not reflexively prefer "more aggressive" augmentation — strength is not the axis that matters
it says what one more probe (scanner-stratified cross-validation, or a held-out scanner) would target

Common Wrong Moves¶

picking heavy_color_and_cutout because it has the best overall score
picking no_augmentation because it is "simpler" when the overfitting after epoch 15 is visible
adding all four recipes on top of each other to hedge
saying "augmentation is task-specific" without naming the shift
ignoring the slice column because it looks smaller than the headline metric

Run The Clinic In Browser¶

Use the browser runner as a scratchpad while you write your note.

Reference Reveal¶

Open only after you write the note

The reference choice is: - `selected_recipe = shift_simulating_rotations_blur` - `decision = ship, then iterate on scanner-stratified evaluation` Why: - the average macro-F1 gap between recipes is small (1 to 3 points) - the slice-recall gap between recipes is very large (22 points) - the slice is not a fairness footnote — it is where the model will fail at deployment - rotations and blur simulate the shift from the dim scanner, so the augmentation carries the right inductive bias for this task - `heavy_color_and_cutout` is the wrong shape of augmentation for an acquisition shift; it regularizes, but not along the axis that matters The practical lesson is not "always pick the slice winner". The lesson is: pick the augmentation whose *shape* matches the shift the model is failing on. When no augmentation matches, invest in data collection before stronger recipes.

What To Do Next¶

After this clinic:

open Data Augmentation
open Vision Augmentation and Shift Robustness
use Vision and Audio Workflows for the full track where the augmentation decision is wired into a defended workflow