Clinic 23
Feature Selection Or Regularize
You have 2,000 features and 8,000 rows. A teammate wants mutual-information feature selection down to 200. You want L1 regularization on all 2,000. Both feel defensible. Only one is — and sometimes it is neither.
Situation
2k Features, 8k Rows
Wide-and-shallow tabular data. Two defensible regularizers-in-the-broad-sense on the table. The trap is doing selection wrong.
Your Job
Pick One, Defend The Split
Choose feature selection, L1 regularization, or both. The real question is where the cross-validation boundary sits.
Bad Habit To Avoid
When Two Pipelines Beat And Lose
One pipeline pre-selects, then trains. One regularizes inside the fold. Decide which AUC reads as evidence and which reads as artifact.
Situation¶
You have a tabular classification problem. 8,000 rows, 2,000 numeric features (some engineered, some raw). Binary label, roughly 25% positive.
A teammate has already run mutual-information feature selection and picked the top 200 features. They trained a logistic regression on those 200 features and got 0.87 AUC on a held-out 20% split. PR title: "MI-filtered 200 features, +4 AUC over the naive 2000-feature baseline."
You have a different plan. Train L1-regularized logistic regression on all 2,000 features; let the regularizer zero out the useless ones. Your first try: 0.84 AUC with the same hold-out split. That is 3 points behind the teammate's number.
Before merging the PR, one of you notices: the teammate ran the MI feature selection on the full dataset (all 8,000 rows, including the 20% held-out split) before splitting. The top-200 features were chosen with test-set signal included.
Artifact Packet¶
Feature selection methods, compared:
| method | mechanism | inside-fold-safe? | typical trap |
|---|---|---|---|
| Mutual information filter | per-feature MI with label | yes — if run inside each fold | looks cheap; easy to accidentally run on full data |
| L1 (LASSO) | joint coefficient shrinkage; some go to zero | yes — fit per fold | tuning C/alpha is the whole game |
| L2 (Ridge) | joint coefficient shrinkage; none exactly zero | yes | keeps all features; model is dense |
| Stepwise / RFE | iterative add/remove | yes — but slow | can be unstable at high feature count |
| Embedded (tree importance) | trained model's feature importance | yes | permutation importance is the honest version |
Key principle: any feature selection method is part of the model. It has to be inside the cross-validation loop or the evaluation leaks.
Reported numbers as the team currently sees them:
| pipeline | where MI was fitted | reported test AUC |
|---|---|---|
| teammate: MI on full data → top 200 → logreg | on all 8,000 rows | 0.87 |
| honest: MI inside each fold → top 200 → logreg | per-fold only | (not yet run) |
| yours: L1 logreg on all 2000 features | per-fold only | 0.84 |
Before reading on, ask which of these three numbers is comparable to which. The 4-point gap on 8k rows is the part of the packet you have to argue about — is it a real signal advantage, a fold-construction artifact, or a leak you can prove?
Decision Prompt¶
Write a six-sentence defense that answers:
- Is the teammate's PR ready to merge? If no, what single change does it need?
- Once both methods are run correctly inside CV, which do you expect to win and by how much?
- What is the tuning surface for each method — and which one is easier to defend with a small grid?
- What would make you pick feature selection and L1 together?
- When does the naive 2,000-feature baseline with no selection and no regularization beat both?
- If AUC is within 0.5 points between MI-in-fold and L1, which ships? Why?
Strong Reasoning Looks Like¶
- blocks the merge because MI was fitted on full data; requires the MI to be done inside a
Pipeline+GridSearchCVso it sees only the training fold of each split - predicts the honest rerun: with MI moved inside CV, the AUC drops to ~0.82–0.83 — roughly tied with L1 at 0.84
- chooses L1 logistic regression as the default when the honest numbers are close: it has one tuning knob (
C), it is interpretable (coefficients tell you what survived), and it is inside-fold-safe by construction - adds MI as a companion inspection, not a competitor: after L1 fits, compare which features L1 kept vs. which MI would have ranked; if they agree, the choice is robust
- defends keeping all 2,000 features when rows are scarce: on 8k rows, L1 on 2k features with cross-validated
Cis usually competitive with any handcrafted filter, and the filter adds a leakage risk - names the failure scenario for "no selection, no regularization": on 8k rows × 2k features, plain logistic regression without any regularization is numerically unstable and will overfit severely — neither option wins when you remove both controls
Common Wrong Moves¶
- merging the 0.87 AUC PR — any number that comes from feature selection run on full data is evaluation pollution
- running MI inside a
Pipelinebut then grid-searchingk(the number of features) using the same held-out set for final reporting — grid searching is still fitting, and the outer split must be disjoint - picking L1 and then standardizing features after splitting — the scaler has to be in the pipeline too
- claiming L1 is "better" on a 0.3-point AUC edge — within single-fold noise
- using MI as a pre-filter to 500 features then running L1 on the survivors — the MI step still has to be per-fold, even as a pre-step
- trusting tree-feature-importance from a boosting model that was trained on all features without inside-fold selection; permutation importance at evaluation time is honest, tree importance at training time is not
Run The Clinic In Browser¶
Use the runner to simulate the leakage — MI on full data vs. MI inside Pipeline+cross_val_score — and watch the AUC drop when the leak is closed.
Reference Reveal¶
Open only after you write the defense
The reference call is **L1 logistic regression with `C` tuned via `GridSearchCV` inside a `Pipeline`; no separate feature-selection step**. Why: - on 8,000 rows × 2,000 features, L1 is the right tool. It does the selection *jointly* with fitting, inside each fold automatically. No leakage surface to police. - MI is fine in principle but its leakage surface (the temptation to run it once on full data) is large. Teams ship MI-based pipelines with leaked selections more often than they ship clean ones. - once the MI pipeline is corrected (selection inside the fold), the AUC almost always comes back to within 1 point of L1 — the 4-point "gap" in the teammate's PR was the leak talking. The right pipeline:from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold
pipe = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(penalty="l1", solver="liblinear", max_iter=2000)),
])
grid = GridSearchCV(
pipe,
param_grid={"clf__C": [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0),
scoring="roc_auc",
n_jobs=-1,
)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)
What this gets you:
- honest per-fold CV; no leakage
- single tuning knob (`C`); a 6-point grid is enough
- interpretable coefficients; the non-zero features are the selected ones
- a baseline honest AUC you can trust
Inspection after the fit:
- count the non-zero coefficients; L1 typically keeps 100–400 features at the best `C` on this shape of data
- compare L1's kept features vs. MI's top 200: if 60–80% overlap, the selection is robust (same signal, two methods)
- check **permutation importance** on a held-out slice for the top 20 L1 features; permutation importance is the honest-at-evaluation-time check
- look at the held-out AUC vs. the CV AUC; if held-out is 1 point above CV, suspect a leak even in the "clean" pipeline
When to deviate:
- if row count grows to 50k+: a small tree-based model (gradient boosting, 500 features after embedded selection) often wins; at 8k rows, linear + L1 is the safer frontier
- if the feature count grows to 20k+: add a cheap **`VarianceThreshold`** filter as the first pipeline step (it is safe to run inside the pipeline and drops nothing informative)
- if interpretability is a hard requirement (regulated industry): stepwise or RFE with stability selection may be worth the extra cost; document the stability runs
The practical lesson: **feature selection, done honestly, is a special case of regularization**. L1 does both steps at once, with one knob, and the knob sits on the right side of the CV boundary by construction. The only time you need a separate selection step is when the selection is itself interpretable — at which point, stability is the thing to measure, not AUC.
What To Do Next¶
- open Feature Selection for filter/wrapper/embedded distinctions
- open Honest Splits and Baselines for pipeline leakage patterns
- open Data Cleaning Choice — the upstream clinic on column handling
- rerun your leaderboard with all feature-selection steps moved inside the CV loop; a ranking that survives is the honest ranking