Honest Splits and Baselines¶

What This Is¶

This page is about one rule:

the split is part of the method — if the split is weak, every later comparison becomes suspect

A real model has to beat a trivial baseline under a split that matches the deployment story, not one chosen for flattering numbers.

When You Use It¶

starting any supervised tabular task
comparing a first learned model against a dummy baseline
deciding whether a stronger model is worth the extra complexity
checking whether a metric or feature idea is actually better than the floor

The Evaluation Contract¶

The honest contract for a first serious workflow is:

define the split rule from the deployment story
reserve a validation set or cross-validation scheme for selection
keep one locked test or holdout for the final check only
compare the dummy baseline and the first learned model under exactly the same split

If the task is small, cross-validation can replace a single validation split for selection — but the locked test still keeps a different role. It is not another knob in the tuning loop.

Two baseline numbers are worth knowing cold:

majority-class accuracy floor: max_c p(y=c)
random-ranking average precision floor: approximately the positive prevalence

Those numbers stop a weak metric choice from looking like progress.

Split Recipe¶

Use one split rule and keep it fixed while you compare models:

from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=0,
)

stratify=y keeps the class mix believable across splits
random_state makes the comparison reproducible
the same rows are used to compare every first-pass model

If the task has groups, time order, or repeated entities, this recipe is not enough — use a split that respects the data shape instead.

For steadier estimates on smaller datasets, wrap the comparison in folds:

from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

Baseline Ladder¶

from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression

dummy = DummyClassifier(strategy="prior").fit(X_train, y_train)
model = LogisticRegression(max_iter=1000).fit(X_train, y_train)

The ladder:

DummyClassifier(strategy="prior") — probability-aware floor, usually the cleanest first reference
DummyClassifier(strategy="most_frequent") — simplest hard-label rule
one honest linear model on the same split
one stronger family, only if the baseline is stable

prior and most_frequent often look similar under predict but differ under predict_proba. That matters when the metric uses probabilities rather than hard labels.

Which Split Should Win¶

Do not ask only whether the model wins. Ask whether it still wins under the split that matches the real task.

random split — useful only when rows are genuinely independent
grouped split — required when repeated entities can leak identity across train and validation
ordered split — required when features are meant to predict the future
leaky split — any split that allows future or duplicate information to cross the boundary; treat as invalid evidence

If a model looks strong on a random split but weak on grouped or ordered validation, the earlier comparison was flattering. The grouped split is not being unfair.

What To Inspect¶

class balance in train and validation before comparing models
whether every model sees the same split and the same preprocessing boundary
baseline versus learned model under the same assumptions
whether the baseline is strong because the task is easy or because the split is leaking
a hard-label baseline and a probability-aware baseline when the metric uses probabilities

If the baseline already looks suspiciously good, inspect the split before celebrating the model.

Failure Pattern¶

The most common failure is building features or choosing the model before the split is fixed. Once the split boundary moves casually, the comparison stops being trustworthy.

Other patterns:

letting a good dummy baseline talk you out of a real model — the baseline is a floor, not the goal
using stratify as a cure-all — it preserves class ratios, not group overlap or feature definition
treating one lucky split as a final answer
accuracy looking strong on imbalanced data where a rare-class metric tells a more honest story

Common Mistakes¶

fitting preprocessing on the full dataset before splitting
changing the split and the model at the same time
claiming a tuning gain without showing baseline and fold spread
comparing metrics that do not match the task cost
reaching for tuning before the baseline is stable

Practice¶

Train a dummy baseline and report its validation accuracy.
Train logistic regression on the same split and compare honestly.
State one class-imbalance situation where accuracy would mislead you.
Explain why prior and most_frequent can agree on hard labels but differ on probabilities.
Describe one change that should be postponed until the baseline is stable.
Name one reason a grouped split can cut the score back toward the dummy floor.

Runnable Example¶

Run the same idea in the browser:

Inspect the dummy-versus-logistic comparison first, then compare it with the leaky variant.

Longer Connection¶

Continue with Leakage Patterns for the boundary-crossing failures this page alludes to, then scikit-learn Validation and Tuning for the full split-selection-tuning workflow.