Skip to content

Clinic 13

First Model Defense

You just trained your first real model. It got 87% accuracy. Someone asks "is it good?" — and you have five minutes to say yes, no, or "we don't know yet" without waving your hands.

Situation

An Excited First Result

A logistic regression on digits hit 87% test accuracy. A classmate asks "is that good?" You want to say yes. Pause.

Your Job

Defend Three Things

Defend the metric, the baseline, and the split. If any of the three is missing, the 87% means nothing.

Bad Habit To Avoid

Reading Only The Number

A metric without its context is decoration. The number is the least informative part of a good answer.

Situation

You are halfway through First Steps. You ran:

from sklearn.datasets import load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=5000).fit(X_train, y_train)
print(model.score(X_test, y_test))   # 0.972

A classmate asks: "is 0.972 good?"

You want to say yes. Do not. Not yet.

Artifact Packet

Four facts about the run, and three you do not know:

fact value
dataset sklearn.datasets.load_digits — 1797 small (8×8) handwritten digits, 10 classes
split random 80/20 via train_test_split
model default LogisticRegression, one setting changed (max_iter=5000)
test accuracy 0.972
baseline not shown the "always-predict-most-common-class" score
per-class behavior not shown confusion matrix
variance not shown score across seeds or CV folds

If the classmate's "is it good?" meant "should I believe this number?", the answer depends entirely on the three things you did not check.

Decision Prompt

Write a four-to-six-sentence defense answering:

  1. What baseline would you compare this 0.972 against, and what is that baseline's score?
  2. Is the split honest for this dataset (does it mix random examples, or must it preserve some structure)?
  3. If you reran with a different random_state, how much would you expect the score to move?
  4. What would change your confidence — up or down — after seeing the confusion matrix?
  5. Is the metric itself the right one, or would a different metric expose a problem this one hides?

No re-running the code yet. The defense is written first, then checked.

Strong Reasoning Looks Like

  • it reports a named baseline (constant-class for 10-balanced digits → ~10% accuracy) and notes 0.972 is far above it
  • it acknowledges that on this balanced 10-class dataset, accuracy is a defensible primary metric — but says so explicitly
  • it predicts a score range across seeds (typically 0.95–0.98 on this tiny dataset) rather than treating 0.972 as a single truth
  • it names the missing inspection: a confusion matrix might reveal a systematic confusion between, for example, 8 and 3 that a single accuracy number hides
  • it is willing to say "it's good on this dataset and split — I don't yet know if it generalizes to new handwriting styles"

Common Wrong Moves

  • saying "yes, 97% is good" without comparing to a baseline
  • treating a single seed as the ground truth — one split, one number, one story
  • reporting accuracy on an imbalanced dataset without mentioning class prevalence (does not apply here, but always ask first)
  • skipping the confusion matrix because the scalar is high
  • forgetting that load_digits is a scrubbed, famous dataset — the number tells you almost nothing about real handwritten-digit performance in the wild

Run The Clinic In Browser

Use the runner to sanity-check the defense — compute the baseline, run the score across seeds, print the confusion matrix.

Reference Reveal

Open only after you write the defense The reference defense is six sentences: 1. The constant-class baseline on 10-class, roughly-balanced `load_digits` scores ~10%; 0.972 is about nine times that. The model is clearly learning something. 2. The split is random and the dataset has no time or group structure worth preserving, so a random split is honest here (unlike with time-series or patient data, where random splits leak). 3. Across 20 different `random_state` values, the test accuracy spans roughly 0.950–0.985. 0.972 is near the middle of that band; it is a plausible sample, not a measurement. 4. The confusion matrix on this dataset typically shows a small cluster of 8↔3 or 5↔8 confusions. If my matrix showed a single class with 50% recall, I would lower confidence sharply. 5. Accuracy is a reasonable primary metric on this balanced 10-class task. If the task were "flag a 7 in a flood of non-7s," I would switch to precision/recall at a chosen threshold. 6. I believe the number *on this dataset and this split*. I do not yet know whether the model generalizes to messier handwriting — `load_digits` is small, pre-centered, and noise-free. Why: - the first sentence converts an isolated metric into a comparison. A score with no reference is decoration. - the second says the quiet part: most split failures are about the *kind* of split. For `load_digits`, random is fine; that deserves a sentence. - the third forces the student to confront variance — a score with no uncertainty interval is over-confident by default - the fourth and fifth interrogate the *shape* of the mistakes, not just the count - the sixth protects against over-generalization — the place where beginner-grade wins most often fail when they hit real data The practical lesson is: every model result is three claims in a trench coat — *this metric*, *on this split*, *beating this baseline*. A defense without all three is a wish.

What To Do Next

After this clinic:

  1. open Honest Splits and Baselines — the systematic treatment of what you just defended
  2. open Public/Private Restraint — the grown-up version of this clinic for competition settings
  3. open Evaluation Metrics Deep Dive — for when accuracy is not the right primary
  4. rerun the code with random_state 0–19 and watch the score span; the banded view is the real answer to "is 0.972 good?"