Skip to content

Honest Splits and Baselines

What This Is

This page is about one rule:

  • the split is part of the method — if the split is weak, every later comparison becomes suspect

A real model has to beat a trivial baseline under a split that matches the deployment story, not one chosen for flattering numbers.

When You Use It

  • starting any supervised tabular task
  • comparing a first learned model against a dummy baseline
  • deciding whether a stronger model is worth the extra complexity
  • checking whether a metric or feature idea is actually better than the floor

The Evaluation Contract

The honest contract for a first serious workflow is:

  1. define the split rule from the deployment story
  2. reserve a validation set or cross-validation scheme for selection
  3. keep one locked test or holdout for the final check only
  4. compare the dummy baseline and the first learned model under exactly the same split

If the task is small, cross-validation can replace a single validation split for selection — but the locked test still keeps a different role. It is not another knob in the tuning loop.

Two baseline numbers are worth knowing cold:

  • majority-class accuracy floor: max_c p(y=c)
  • random-ranking average precision floor: approximately the positive prevalence

Those numbers stop a weak metric choice from looking like progress.

Split Recipe

Use one split rule and keep it fixed while you compare models:

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=0,
)
  • stratify=y keeps the class mix believable across splits
  • random_state makes the comparison reproducible
  • the same rows are used to compare every first-pass model

If the task has groups, time order, or repeated entities, this recipe is not enough — use a split that respects the data shape instead.

For steadier estimates on smaller datasets, wrap the comparison in folds:

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

Baseline Ladder

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

dummy = DummyClassifier(strategy="prior").fit(X_train, y_train)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)

The ladder:

  • DummyClassifier(strategy="prior") — probability-aware floor, usually the cleanest first reference
  • DummyClassifier(strategy="most_frequent") — simplest hard-label rule
  • one honest linear model on the same split
  • one stronger family, only if the baseline is stable

prior and most_frequent often look similar under predict but differ under predict_proba. That matters when the metric uses probabilities rather than hard labels.

Which Split Should Win

Do not ask only whether the model wins. Ask whether it still wins under the split that matches the real task.

  • random split — useful only when rows are genuinely independent
  • grouped split — required when repeated entities can leak identity across train and validation
  • ordered split — required when features are meant to predict the future
  • leaky split — any split that allows future or duplicate information to cross the boundary; treat as invalid evidence

If a model looks strong on a random split but weak on grouped or ordered validation, the earlier comparison was flattering. The grouped split is not being unfair.

What To Inspect

  • class balance in train and validation before comparing models
  • whether every model sees the same split and the same preprocessing boundary
  • baseline versus learned model under the same assumptions
  • whether the baseline is strong because the task is easy or because the split is leaking
  • a hard-label baseline and a probability-aware baseline when the metric uses probabilities

If the baseline already looks suspiciously good, inspect the split before celebrating the model.

Failure Pattern

The most common failure is building features or choosing the model before the split is fixed. Once the split boundary moves casually, the comparison stops being trustworthy.

Other patterns:

  • letting a good dummy baseline talk you out of a real model — the baseline is a floor, not the goal
  • using stratify as a cure-all — it preserves class ratios, not group overlap or feature definition
  • treating one lucky split as a final answer
  • accuracy looking strong on imbalanced data where a rare-class metric tells a more honest story

Common Mistakes

  • fitting preprocessing on the full dataset before splitting
  • changing the split and the model at the same time
  • claiming a tuning gain without showing baseline and fold spread
  • comparing metrics that do not match the task cost
  • reaching for tuning before the baseline is stable

Practice

  1. Train a dummy baseline and report its validation accuracy.
  2. Train logistic regression on the same split and compare honestly.
  3. State one class-imbalance situation where accuracy would mislead you.
  4. Explain why prior and most_frequent can agree on hard labels but differ on probabilities.
  5. Describe one change that should be postponed until the baseline is stable.
  6. Name one reason a grouped split can cut the score back toward the dummy floor.

Runnable Example

Run the same idea in the browser:

Inspect the dummy-versus-logistic comparison first, then compare it with the leaky variant.

Longer Connection

Continue with Leakage Patterns for the boundary-crossing failures this page alludes to, then scikit-learn Validation and Tuning for the full split-selection-tuning workflow.