Skip to content

Clinic 22

Data Cleaning Choice

One messy column. 27% missing values. Four plausible fixes: drop the column, drop the rows, impute with the mean, or keep the missingness as a signal. Only one is defensible for this task.

Situation

27% Missing On A Critical Column

You can drop, impute, model the missingness, or treat it as a feature. The right move depends on a property of the data you have to argue from.

Your Job

Pick One Handling

Drop, row-filter, impute, or treat missingness as a feature. Defend it under the task constraints.

Bad Habit To Avoid

Four Recipes, One Mechanism

Drop, mean-impute, model-impute, or keep. Argue from the missingness mechanism, not from the validation metric.

Situation

You are building a churn classifier for a subscription business. The dataset has 80,000 rows and 45 features. Most columns are fine; one column — last_support_ticket_days (days since the customer's most recent support interaction) — is 27% missing.

You check why it is missing. The product data team confirms:

  • the field is populated only for customers who have ever opened a support ticket
  • customers who never opened a ticket show NULL
  • churn rate for never-opened customers is ~4%; churn rate for ever-opened customers is ~15%

The missingness is MNAR (Missing Not At Random) — the reason a value is missing is itself informative about the label. 27% of rows are missing, and that 27% is disproportionately the never-churners.

Four fixes have been proposed in review:

  • A. Drop the column — simplest; remove last_support_ticket_days entirely
  • B. Drop the rows — keep the column; drop the 27% of rows where it is missing
  • C. Impute with the mean — fill NULL with the mean of non-missing values; keep all rows
  • D. Keep missingness as a signal — add a boolean has_ever_ticketed; fill NULL with a sentinel (e.g., −1 or 9999); keep the original column

The business wants the best model by Thursday. The data scientist on the team has voted C. You have five minutes to agree or disagree before the PR closes.

Artifact Packet

Mechanism check — what each fix assumes about the missingness:

fix assumes about missingness typical failure mode
A. Drop column the column is not worth the hassle discards useful information
B. Drop rows missingness is MCAR; rows are interchangeable training prior shifts if the missing rows are not random
C. Mean impute missingness is MAR; the conditional mean fills the gap reasonably blurs any signal carried by the missingness itself
D. Missingness as feature missingness is itself informative adds little if the missingness is genuinely random

Decide which of MCAR / MAR / MNAR best describes this dataset before reading the next block. Then read the data-side facts:

  • the 27% of users with days_since_last_ticket = NaN are users who have never opened a ticket
  • among never-ticketers, the empirical churn rate is ~3% (well below the 9% global rate)
  • among ticketers, the conditional churn rate sits around 11% and rises with the days-since value
  • a bare has_ever_ticketed indicator on its own already moves AUC by +0.02

The packet does not tell you which option is right. Argue from the mechanism, then check whether the data-side facts above support your argument.

Cross-reference:

  • mean imputation on an MNAR column is a known pattern for silently moving the prevalence of the label in your training set, which propagates into miscalibrated predictions
  • prior-shift from row-dropping is documented in Leakage Patterns under "prior shift from filtering"

Decision Prompt

Write a six-sentence defense that answers:

  1. Which fix do you pick, and what does it assume about the missingness mechanism?
  2. How do you prove the mechanism is MNAR, MAR, or MCAR from the data in hand?
  3. How will the chosen fix affect your training-set class prior, and how do you correct for that at deployment?
  4. What single inspection shows whether mean imputation is silently degrading the model?
  5. What would change your pick if the missingness were only 2%?
  6. What would change your pick if the column were fully random missingness (MCAR)?

Strong Reasoning Looks Like

  • picks D (missingness as signal) and calls the mechanism MNAR — the never-ticketers are a structurally different population
  • rejects B (drop rows) explicitly: losing 27% of the training set, skewed toward non-churners, moves the class prior and makes calibration a mess
  • rejects C (mean imputation): on MNAR data, the mean is the wrong number; the missingness itself is the predictive feature and mean imputation destroys it
  • rejects A (drop column) because the churn rate difference (4% vs 15%) is large enough that discarding the signal loses real AUC
  • names the inspection: compare held-out AUC and calibration between C and D on the same split; if D beats C by ≥ 2 AUC points, the mechanism argument is confirmed empirically
  • adjusts for the 2% case: at 2% missingness, C and D are near-identical in AUC; simpler is fine. At 27%, the mechanism matters.

Common Wrong Moves

  • merging option C because "imputation is standard" — standard for MCAR/MAR, not MNAR
  • dropping the 27% of rows to "keep the column clean" — you just dropped the low-churn segment and broke the prior
  • dropping the column to avoid the conversation — 4% vs 15% churn split between ticketers/never-ticketers is signal, not noise
  • coding the missingness as 0 (or -1) without the companion boolean has_ever_ticketed — now the model has to untangle "small value" from "never had one"
  • using SimpleImputer(strategy="mean") inside a pipeline and calling it solved — the pipeline is the right structure; the strategy is the wrong one here
  • skipping the sensitivity analysis at the boundary — at 2% missing, D and C are interchangeable; at 27%, they are not

Run The Clinic In Browser

Use the runner to generate a small MNAR dataset and compare AUC and calibration across the four options; option D wins when missingness is informative.

Reference Reveal

Open only after you write the defense The reference call is **D — missingness as signal, plus the original value kept as a feature for non-missing rows**. Mechanical implementation:
df["has_ever_ticketed"] = df["last_support_ticket_days"].notna().astype(int)
df["last_support_ticket_days"] = df["last_support_ticket_days"].fillna(-1)
# a tree-based model handles the -1 sentinel fine; for linear models use missingness
# flag + mean-imputed value with the flag
Why each rejected option fails: - **A (drop column)**: you lose the ~11-point churn-rate gap between ticketers and never-ticketers — a major feature for any realistic churn classifier. Expected cost: 2–4 AUC points. - **B (drop rows)**: class prior moves from 9% churn → ~12% churn; every probability output is miscalibrated; recall at fixed precision moves unpredictably; on deployment the prior mismatch becomes visible the first week. - **C (mean impute)**: mean days-since-ticket is some middling number like 40 days; you are telling the model the never-ticketers *have* ticketed, 40 days ago. The model cannot recover the MNAR signal from that. Validation plan: - split first (with `StratifiedKFold` to preserve churn prevalence), then impute *inside the pipeline* per fold — never fit the imputer on the full data before splitting - measure AUC, Brier score, and calibration error on the held-out folds; D should beat C by ≥ 2 AUC points and have 2–3× better calibration - slice the eval: recall@20%-budget on churners-who-never-ticketed; if this slice collapses under option C, that is the smoking gun for the MNAR argument Mechanism diagnostic (what you would do if the mechanism were *not* known): - compare the label distribution on rows with and without the value: if `P(y|missing) ≈ P(y)`, missingness is MCAR/MAR; if they differ by > 1 percentage point, it is MNAR - a 4% vs 15% split (from the artifact packet) is MNAR by a wide margin Edge cases: - if missingness were only 2%, all four options are within 0.5 AUC of each other; pick the one the pipeline already supports and move on - if the column were MCAR (missingness truly random, e.g., a race-condition in logging) and ≤ 10%, mean imputation is fine and the extra boolean feature adds noise - if the column were MCAR but > 30% missing, dropping the column is often cleaner than imputing over half-fabricated values The practical lesson: **the right cleaning fix depends on the missingness mechanism, not the percentage missing**. `fillna(mean)` is a default that is right by accident when missingness is MCAR and wrong by design when missingness is MNAR. The diagnostic — compare `P(y|missing)` to `P(y)` — takes one line of pandas and decides the fix.

What To Do Next

  1. open Tabular Feature Engineering for the missingness-as-feature pattern in depth
  2. open Leakage Patterns for prior shift and filter-based leakage
  3. open Feature Selection Or Regularize — the adjacent clinic on what to do after cleaning is handled
  4. run the P(y|missing) diagnostic on every column in your dataset with ≥ 5% missing; most cleaning decisions come straight out of that table