Skip to content

Clinic 06

Ensemble Temptation

The stacked ensemble is the best thing on the board. But inference is five times slower, the pipeline is three times more complex, and the gain over the single model is small. Is it worth it?

Situation

Tiny Gain, Heavy Cost

The ensemble wins every metric by a small margin. But the operational cost is not small.

Your Job

Ship Simple Or Ship Complex

Decide whether the gain justifies the cost. Name the threshold where the ensemble becomes worth it.

Bad Habit To Avoid

Best Score = Best Model

If the decision ignores inference cost, maintenance burden, and failure modes, the comparison is incomplete.

Situation

You are building a customer churn predictor for a weekly batch scoring job. Two candidates survived the validation phase.

The packet says:

  • the tuned gradient boosting model is strong and simple
  • the stacked ensemble (gradient boosting + random forest + logistic regression + meta-learner) is marginally better
  • the deployment pipeline must be maintainable by a two-person team

Artifact Packet

Read this packet before you decide:

candidate val ROC AUC val avg precision inference time (1k rows) pipeline components weekly retrain time
tuned_gbm 0.847 0.612 0.3s 1 model + preprocessor 4 min
stacked_ensemble 0.861 0.638 1.7s 4 models + meta-learner + preprocessor 22 min

Stability across 5 CV folds:

candidate fold 1 fold 2 fold 3 fold 4 fold 5 std
tuned_gbm 0.841 0.852 0.839 0.849 0.854 0.006
stacked_ensemble 0.867 0.858 0.843 0.871 0.866 0.010

Weak slice (new_customers, first 30 days):

candidate ROC AUC avg precision
tuned_gbm 0.791 0.489
stacked_ensemble 0.802 0.507

Decision Prompt

Write the note before you open the reveal.

Your note should answer:

  1. Which candidate would you deploy?
  2. Is the ensemble gain large enough to justify the operational cost?
  3. What gain threshold would change your mind?
  4. Does the weak slice difference matter enough to override the simplicity argument?

Keep the note short. Four to six sentences is enough.

Strong Reasoning Looks Like

  • it acknowledges the ensemble wins on every metric but questions whether the margin justifies the cost
  • it considers the maintenance burden for a two-person team
  • it notes the ensemble has higher fold variance (0.010 vs 0.006)
  • it names a concrete scenario where the ensemble becomes worth it (e.g., if the weak slice gap were 5+ points instead of 1.1)
  • it is willing to ship the simpler model when the gain is marginal

Common Wrong Moves

  • choosing the ensemble because it is best on every metric without mentioning cost
  • rejecting the ensemble by saying "simple is always better" without engaging with the numbers
  • ignoring the fold stability difference
  • treating the weak slice delta of 1.1 points as decisive when it is within noise

Run The Clinic In Browser

Validate Your Decision In Browser

Reference Reveal

Open only after you write the note The reference choice is: - `selected_candidate = tuned_gbm` - `decision = deploy the simpler model` Why: - the ROC AUC gap is 0.014 and the average precision gap is 0.026, both marginal - the ensemble is 5.7x slower at inference and 5.5x slower to retrain - the ensemble has higher fold variance, meaning its advantage is less stable - the weak slice improvement (1.1 points ROC AUC) is within the fold noise - a two-person team maintaining a 4-model stacked pipeline creates operational risk that outweighs a marginal accuracy gain The ensemble becomes worth it when: the gain on the critical slice is large and stable (not 1.1 points), or the inference budget is not a constraint, or the team grows enough to absorb the complexity. The practical lesson: a marginal gain that costs real operational complexity is usually not worth shipping.

What To Do Next

After this clinic:

  1. open Ensemble Methods
  2. run the matching ensemble comparison example
  3. use scikit-learn Validation and Tuning for the full model selection workflow