Skip to content

Clinic 11

Augmentation Choice

Four augmentation recipes, four validation curves, one weak slice telling a different story. Pick the augmentation you would ship and defend why.

Situation

Four Recipes, One Choice

You tried four augmentation stacks for an image classifier. Validation scores are close. The weak slice is not.

Your Job

Pick For The Weak Slice

Pick the augmentation recipe you would ship, decide whether to iterate, and name what evidence would change your mind.

Bad Habit To Avoid

More Augmentation = Better

If the reasoning is "stronger augmentation always wins," the clinic failed.

Situation

You are training an image classifier on a medical imaging task. The dataset has:

  • ~8,000 training images across 5 classes
  • a rare 4th class ("late-stage pathology") with ~400 training images
  • a validation split that preserves class ratios
  • a known weak slice: images acquired on one of the two scanner types, where the pathology appears dimmer

You have four augmentation recipes and matching curves:

Artifact Packet

Read the packet before you decide:

recipe validation macro-F1 class-4 recall dim-scanner slice recall training stability notes
no_augmentation 0.782 0.61 0.44 smooth overfits after epoch 15
standard_flips_crops 0.798 0.68 0.52 smooth standard ImageNet-style recipe
heavy_color_and_cutout 0.810 0.66 0.49 slightly noisy strong color jitter + CutMix
shift_simulating_rotations_blur 0.803 0.72 0.74 smooth rotations + blur match the acquisition shift

The tempting move is obvious: heavy_color_and_cutout wins the average macro-F1.

The harder question is which recipe makes the model behave better on the slice that actually matters — the dim-scanner images. The blur-and-rotation recipe is behind on average but 22 points ahead on the failing slice.

Decision Prompt

Write the note before you open the reveal.

Your note should answer:

  1. Which augmentation recipe would you ship right now?
  2. Would you stop or iterate, and if iterating, what would you target?
  3. Which single number in the packet drove the decision?
  4. What evidence would change your mind?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like

  • it names the weak slice as the real constraint, not the average macro-F1
  • it recognizes that the augmentations in shift_simulating_rotations_blur are shaped like the acquisition shift, not like generic regularization
  • it treats the 22-point gap on the slice as much stronger evidence than the 1-point gap on the overall metric
  • it does not reflexively prefer "more aggressive" augmentation — strength is not the axis that matters
  • it says what one more probe (scanner-stratified cross-validation, or a held-out scanner) would target

Common Wrong Moves

  • picking heavy_color_and_cutout because it has the best overall score
  • picking no_augmentation because it is "simpler" when the overfitting after epoch 15 is visible
  • adding all four recipes on top of each other to hedge
  • saying "augmentation is task-specific" without naming the shift
  • ignoring the slice column because it looks smaller than the headline metric

Run The Clinic In Browser

Use the browser runner as a scratchpad while you write your note.

Reference Reveal

Open only after you write the note The reference choice is: - `selected_recipe = shift_simulating_rotations_blur` - `decision = ship, then iterate on scanner-stratified evaluation` Why: - the average macro-F1 gap between recipes is small (1 to 3 points) - the slice-recall gap between recipes is very large (22 points) - the slice is not a fairness footnote — it is where the model will fail at deployment - rotations and blur simulate the shift from the dim scanner, so the augmentation carries the right inductive bias for this task - `heavy_color_and_cutout` is the wrong shape of augmentation for an acquisition shift; it regularizes, but not along the axis that matters The practical lesson is not "always pick the slice winner". The lesson is: pick the augmentation whose *shape* matches the shift the model is failing on. When no augmentation matches, invest in data collection before stronger recipes.

What To Do Next

After this clinic:

  1. open Data Augmentation
  2. open Vision Augmentation and Shift Robustness
  3. use Vision and Audio Workflows for the full track where the augmentation decision is wired into a defended workflow