Clinic 06

Ensemble Temptation

The stacked ensemble is the best thing on the board. But inference is five times slower, the pipeline is three times more complex, and the gain over the single model is small. Is it worth it?

Back To Clinics Open Ensemble Topic Open The Full Track

Situation

Tiny Gain, Heavy Cost

The ensemble wins every metric by a small margin. But the operational cost is not small.

Your Job

Ship Simple Or Ship Complex

Decide whether the gain justifies the cost. Name the threshold where the ensemble becomes worth it.

Bad Habit To Avoid

Best Score = Best Model

If the decision ignores inference cost, maintenance burden, and failure modes, the comparison is incomplete.

Situation¶

You are building a customer churn predictor for a weekly batch scoring job. Two candidates survived the validation phase.

The packet says:

the tuned gradient boosting model is strong and simple
the stacked ensemble (gradient boosting + random forest + logistic regression + meta-learner) is marginally better
the deployment pipeline must be maintainable by a two-person team

Artifact Packet¶

Read this packet before you decide:

candidate	val ROC AUC	val avg precision	inference time (1k rows)	pipeline components	weekly retrain time
`tuned_gbm`	0.847	0.612	0.3s	1 model + preprocessor	4 min
`stacked_ensemble`	0.861	0.638	1.7s	4 models + meta-learner + preprocessor	22 min

Stability across 5 CV folds:

candidate	fold 1	fold 2	fold 3	fold 4	fold 5	std
`tuned_gbm`	0.841	0.852	0.839	0.849	0.854	0.006
`stacked_ensemble`	0.867	0.858	0.843	0.871	0.866	0.010

Weak slice (new_customers, first 30 days):

candidate	ROC AUC	avg precision
`tuned_gbm`	0.791	0.489
`stacked_ensemble`	0.802	0.507

Decision Prompt¶

Write the note before you open the reveal.

Your note should answer:

Which candidate would you deploy?
Is the ensemble gain large enough to justify the operational cost?
What gain threshold would change your mind?
Does the weak slice difference matter enough to override the simplicity argument?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like¶

it acknowledges the ensemble wins on every metric but questions whether the margin justifies the cost
it considers the maintenance burden for a two-person team
it notes the ensemble has higher fold variance (0.010 vs 0.006)
it names a concrete scenario where the ensemble becomes worth it (e.g., if the weak slice gap were 5+ points instead of 1.1)
it is willing to ship the simpler model when the gain is marginal

Common Wrong Moves¶

choosing the ensemble because it is best on every metric without mentioning cost
rejecting the ensemble by saying "simple is always better" without engaging with the numbers
ignoring the fold stability difference
treating the weak slice delta of 1.1 points as decisive when it is within noise

Run The Clinic In Browser¶

Validate Your Decision In Browser¶

Reference Reveal¶

Open only after you write the note

The reference choice is: - `selected_candidate = tuned_gbm` - `decision = deploy the simpler model` Why: - the ROC AUC gap is 0.014 and the average precision gap is 0.026, both marginal - the ensemble is 5.7x slower at inference and 5.5x slower to retrain - the ensemble has higher fold variance, meaning its advantage is less stable - the weak slice improvement (1.1 points ROC AUC) is within the fold noise - a two-person team maintaining a 4-model stacked pipeline creates operational risk that outweighs a marginal accuracy gain The ensemble becomes worth it when: the gain on the critical slice is large and stable (not 1.1 points), or the inference budget is not a constraint, or the team grows enough to absorb the complexity. The practical lesson: a marginal gain that costs real operational complexity is usually not worth shipping.

What To Do Next¶

After this clinic:

open Ensemble Methods
run the matching ensemble comparison example
use scikit-learn Validation and Tuning for the full model selection workflow