Skip to content

Clinic 23

Feature Selection Or Regularize

You have 2,000 features and 8,000 rows. A teammate wants mutual-information feature selection down to 200. You want L1 regularization on all 2,000. Both feel defensible. Only one is — and sometimes it is neither.

Situation

2k Features, 8k Rows

Wide-and-shallow tabular data. Two defensible regularizers-in-the-broad-sense on the table. The trap is doing selection wrong.

Your Job

Pick One, Defend The Split

Choose feature selection, L1 regularization, or both. The real question is where the cross-validation boundary sits.

Bad Habit To Avoid

When Two Pipelines Beat And Lose

One pipeline pre-selects, then trains. One regularizes inside the fold. Decide which AUC reads as evidence and which reads as artifact.

Situation

You have a tabular classification problem. 8,000 rows, 2,000 numeric features (some engineered, some raw). Binary label, roughly 25% positive.

A teammate has already run mutual-information feature selection and picked the top 200 features. They trained a logistic regression on those 200 features and got 0.87 AUC on a held-out 20% split. PR title: "MI-filtered 200 features, +4 AUC over the naive 2000-feature baseline."

You have a different plan. Train L1-regularized logistic regression on all 2,000 features; let the regularizer zero out the useless ones. Your first try: 0.84 AUC with the same hold-out split. That is 3 points behind the teammate's number.

Before merging the PR, one of you notices: the teammate ran the MI feature selection on the full dataset (all 8,000 rows, including the 20% held-out split) before splitting. The top-200 features were chosen with test-set signal included.

Artifact Packet

Feature selection methods, compared:

method mechanism inside-fold-safe? typical trap
Mutual information filter per-feature MI with label yes — if run inside each fold looks cheap; easy to accidentally run on full data
L1 (LASSO) joint coefficient shrinkage; some go to zero yes — fit per fold tuning C/alpha is the whole game
L2 (Ridge) joint coefficient shrinkage; none exactly zero yes keeps all features; model is dense
Stepwise / RFE iterative add/remove yes — but slow can be unstable at high feature count
Embedded (tree importance) trained model's feature importance yes permutation importance is the honest version

Key principle: any feature selection method is part of the model. It has to be inside the cross-validation loop or the evaluation leaks.

Reported numbers as the team currently sees them:

pipeline where MI was fitted reported test AUC
teammate: MI on full data → top 200 → logreg on all 8,000 rows 0.87
honest: MI inside each fold → top 200 → logreg per-fold only (not yet run)
yours: L1 logreg on all 2000 features per-fold only 0.84

Before reading on, ask which of these three numbers is comparable to which. The 4-point gap on 8k rows is the part of the packet you have to argue about — is it a real signal advantage, a fold-construction artifact, or a leak you can prove?

Decision Prompt

Write a six-sentence defense that answers:

  1. Is the teammate's PR ready to merge? If no, what single change does it need?
  2. Once both methods are run correctly inside CV, which do you expect to win and by how much?
  3. What is the tuning surface for each method — and which one is easier to defend with a small grid?
  4. What would make you pick feature selection and L1 together?
  5. When does the naive 2,000-feature baseline with no selection and no regularization beat both?
  6. If AUC is within 0.5 points between MI-in-fold and L1, which ships? Why?

Strong Reasoning Looks Like

  • blocks the merge because MI was fitted on full data; requires the MI to be done inside a Pipeline + GridSearchCV so it sees only the training fold of each split
  • predicts the honest rerun: with MI moved inside CV, the AUC drops to ~0.82–0.83 — roughly tied with L1 at 0.84
  • chooses L1 logistic regression as the default when the honest numbers are close: it has one tuning knob (C), it is interpretable (coefficients tell you what survived), and it is inside-fold-safe by construction
  • adds MI as a companion inspection, not a competitor: after L1 fits, compare which features L1 kept vs. which MI would have ranked; if they agree, the choice is robust
  • defends keeping all 2,000 features when rows are scarce: on 8k rows, L1 on 2k features with cross-validated C is usually competitive with any handcrafted filter, and the filter adds a leakage risk
  • names the failure scenario for "no selection, no regularization": on 8k rows × 2k features, plain logistic regression without any regularization is numerically unstable and will overfit severely — neither option wins when you remove both controls

Common Wrong Moves

  • merging the 0.87 AUC PR — any number that comes from feature selection run on full data is evaluation pollution
  • running MI inside a Pipeline but then grid-searching k (the number of features) using the same held-out set for final reporting — grid searching is still fitting, and the outer split must be disjoint
  • picking L1 and then standardizing features after splitting — the scaler has to be in the pipeline too
  • claiming L1 is "better" on a 0.3-point AUC edge — within single-fold noise
  • using MI as a pre-filter to 500 features then running L1 on the survivors — the MI step still has to be per-fold, even as a pre-step
  • trusting tree-feature-importance from a boosting model that was trained on all features without inside-fold selection; permutation importance at evaluation time is honest, tree importance at training time is not

Run The Clinic In Browser

Use the runner to simulate the leakage — MI on full data vs. MI inside Pipeline+cross_val_score — and watch the AUC drop when the leak is closed.

Reference Reveal

Open only after you write the defense The reference call is **L1 logistic regression with `C` tuned via `GridSearchCV` inside a `Pipeline`; no separate feature-selection step**. Why: - on 8,000 rows × 2,000 features, L1 is the right tool. It does the selection *jointly* with fitting, inside each fold automatically. No leakage surface to police. - MI is fine in principle but its leakage surface (the temptation to run it once on full data) is large. Teams ship MI-based pipelines with leaked selections more often than they ship clean ones. - once the MI pipeline is corrected (selection inside the fold), the AUC almost always comes back to within 1 point of L1 — the 4-point "gap" in the teammate's PR was the leak talking. The right pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, StratifiedKFold

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(penalty="l1", solver="liblinear", max_iter=2000)),
])

grid = GridSearchCV(
    pipe,
    param_grid={"clf__C": [0.01, 0.03, 0.1, 0.3, 1.0, 3.0]},
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=0),
    scoring="roc_auc",
    n_jobs=-1,
)
grid.fit(X_train, y_train)
print(grid.best_params_, grid.best_score_)
What this gets you: - honest per-fold CV; no leakage - single tuning knob (`C`); a 6-point grid is enough - interpretable coefficients; the non-zero features are the selected ones - a baseline honest AUC you can trust Inspection after the fit: - count the non-zero coefficients; L1 typically keeps 100–400 features at the best `C` on this shape of data - compare L1's kept features vs. MI's top 200: if 60–80% overlap, the selection is robust (same signal, two methods) - check **permutation importance** on a held-out slice for the top 20 L1 features; permutation importance is the honest-at-evaluation-time check - look at the held-out AUC vs. the CV AUC; if held-out is 1 point above CV, suspect a leak even in the "clean" pipeline When to deviate: - if row count grows to 50k+: a small tree-based model (gradient boosting, 500 features after embedded selection) often wins; at 8k rows, linear + L1 is the safer frontier - if the feature count grows to 20k+: add a cheap **`VarianceThreshold`** filter as the first pipeline step (it is safe to run inside the pipeline and drops nothing informative) - if interpretability is a hard requirement (regulated industry): stepwise or RFE with stability selection may be worth the extra cost; document the stability runs The practical lesson: **feature selection, done honestly, is a special case of regularization**. L1 does both steps at once, with one knob, and the knob sits on the right side of the CV boundary by construction. The only time you need a separate selection step is when the selection is itself interpretable — at which point, stability is the thing to measure, not AUC.

What To Do Next

  1. open Feature Selection for filter/wrapper/embedded distinctions
  2. open Honest Splits and Baselines for pipeline leakage patterns
  3. open Data Cleaning Choice — the upstream clinic on column handling
  4. rerun your leaderboard with all feature-selection steps moved inside the CV loop; a ranking that survives is the honest ranking