Clinic 06
Ensemble Temptation
The stacked ensemble is the best thing on the board. But inference is five times slower, the pipeline is three times more complex, and the gain over the single model is small. Is it worth it?
Situation
Tiny Gain, Heavy Cost
The ensemble wins every metric by a small margin. But the operational cost is not small.
Your Job
Ship Simple Or Ship Complex
Decide whether the gain justifies the cost. Name the threshold where the ensemble becomes worth it.
Bad Habit To Avoid
Best Score = Best Model
If the decision ignores inference cost, maintenance burden, and failure modes, the comparison is incomplete.
Situation¶
You are building a customer churn predictor for a weekly batch scoring job. Two candidates survived the validation phase.
The packet says:
- the tuned gradient boosting model is strong and simple
- the stacked ensemble (gradient boosting + random forest + logistic regression + meta-learner) is marginally better
- the deployment pipeline must be maintainable by a two-person team
Artifact Packet¶
Read this packet before you decide:
| candidate | val ROC AUC | val avg precision | inference time (1k rows) | pipeline components | weekly retrain time |
|---|---|---|---|---|---|
tuned_gbm |
0.847 | 0.612 | 0.3s | 1 model + preprocessor | 4 min |
stacked_ensemble |
0.861 | 0.638 | 1.7s | 4 models + meta-learner + preprocessor | 22 min |
Stability across 5 CV folds:
| candidate | fold 1 | fold 2 | fold 3 | fold 4 | fold 5 | std |
|---|---|---|---|---|---|---|
tuned_gbm |
0.841 | 0.852 | 0.839 | 0.849 | 0.854 | 0.006 |
stacked_ensemble |
0.867 | 0.858 | 0.843 | 0.871 | 0.866 | 0.010 |
Weak slice (new_customers, first 30 days):
| candidate | ROC AUC | avg precision |
|---|---|---|
tuned_gbm |
0.791 | 0.489 |
stacked_ensemble |
0.802 | 0.507 |
Decision Prompt¶
Write the note before you open the reveal.
Your note should answer:
- Which candidate would you deploy?
- Is the ensemble gain large enough to justify the operational cost?
- What gain threshold would change your mind?
- Does the weak slice difference matter enough to override the simplicity argument?
Keep the note short. Four to six sentences is enough.
Strong Reasoning Looks Like¶
- it acknowledges the ensemble wins on every metric but questions whether the margin justifies the cost
- it considers the maintenance burden for a two-person team
- it notes the ensemble has higher fold variance (0.010 vs 0.006)
- it names a concrete scenario where the ensemble becomes worth it (e.g., if the weak slice gap were 5+ points instead of 1.1)
- it is willing to ship the simpler model when the gain is marginal
Common Wrong Moves¶
- choosing the ensemble because it is best on every metric without mentioning cost
- rejecting the ensemble by saying "simple is always better" without engaging with the numbers
- ignoring the fold stability difference
- treating the weak slice delta of 1.1 points as decisive when it is within noise
Run The Clinic In Browser¶
Validate Your Decision In Browser¶
Reference Reveal¶
Open only after you write the note
The reference choice is: - `selected_candidate = tuned_gbm` - `decision = deploy the simpler model` Why: - the ROC AUC gap is 0.014 and the average precision gap is 0.026, both marginal - the ensemble is 5.7x slower at inference and 5.5x slower to retrain - the ensemble has higher fold variance, meaning its advantage is less stable - the weak slice improvement (1.1 points ROC AUC) is within the fold noise - a two-person team maintaining a 4-model stacked pipeline creates operational risk that outweighs a marginal accuracy gain The ensemble becomes worth it when: the gain on the critical slice is large and stable (not 1.1 points), or the inference budget is not a constraint, or the team grows enough to absorb the complexity. The practical lesson: a marginal gain that costs real operational complexity is usually not worth shipping.What To Do Next¶
After this clinic:
- open Ensemble Methods
- run the matching ensemble comparison example
- use scikit-learn Validation and Tuning for the full model selection workflow