Skip to content

Clinic 19

Cluster Stability

KMeans on a marketing dataset gave you k=5 with a satisfying elbow. A teammate reran it with a different seed and got k=4 with a better silhouette. Before you pick, decide whether either clustering is real.

Situation

Two Seeds, Two Answers

Seed 0 says k=5. Seed 1 says k=4. Both have "okay" silhouettes. The business team wants one answer by Friday.

Your Job

Measure Stability

Stop eyeballing single-seed plots. Pick the clustering the data supports — or report that the data does not support one.

Bad Habit To Avoid

Committing To One Run

A cluster count from one seed is a draw from a distribution. Report the distribution, not the draw.

Situation

A marketing team wants a segmentation of 40,000 customers. The feature set is 22 numeric columns (tenure, spend, engagement signals) after scaling.

You and a teammate ran the same pipeline:

  • you: KMeans(n_clusters=5, random_state=0) → silhouette 0.31
  • teammate: KMeans(n_clusters=4, random_state=7) → silhouette 0.34

The teammate wants to ship k=4. You want to ship k=5. Both answers are from a single seed, a single k, and a single silhouette number. Neither clustering has been rerun with different seeds. Neither has been compared against a random baseline.

The deadline is Friday. The business team wants one segmentation to build campaigns on — and they are not going to rerun it next quarter unless something is visibly broken.

Artifact Packet

Standard stability diagnostics:

diagnostic what it measures trustworthy threshold (rough)
silhouette score how well-separated clusters are, per point ≥ 0.30 for "visible structure"
Davies-Bouldin cluster compactness over separation lower is better; < 1.0 is good
Calinski-Harabasz between/within variance ratio higher is better; scale-dependent
ARI across seeds how similar two clusterings from different seeds are ≥ 0.7 for "stable"
gap statistic silhouette vs. silhouette on uniform-random data positive gap at the chosen k
consensus matrix co-cluster frequency across many seeds high values on the diagonal, low off-diagonal

Quick facts about the current pipeline:

  • neither teammate swept k — you picked 5, they picked 4, each because "the elbow looked right"
  • neither measured the ARI between the two clusterings (i.e., how much do the two answers even agree?)
  • the gap statistic was not computed, so neither of you knows whether the silhouette difference is meaningful vs. random data
  • KMeans was run with n_init=10 in both cases (the default since scikit-learn 1.4), so the "seed" question is about which 10 initializations each run drew, not about a single bad init

Rough ARI between the two existing clusterings, computed after the fact: 0.52. On a scale where 1.0 is "identical" and 0.0 is "random overlap," 0.52 means roughly half the structure agrees. That is a weak answer.

Decision Prompt

Write a six-sentence defense that answers:

  1. What is the right procedure before picking k=4 or k=5 (not the pick itself)?
  2. What ARI threshold across seeds would make you trust a given k?
  3. If the gap statistic is flat between k=3 and k=6, what do you recommend to the business team?
  4. How do you explain "the data does not strongly support any k" without losing the project?
  5. What would make you pick a k that is not the highest-silhouette k?
  6. Would you use hierarchical clustering or a different algorithm as a tiebreaker? Why or why not?

Strong Reasoning Looks Like

  • runs a full stability sweep: for each k ∈ {2..10}, run 20 seeds, compute silhouette + Davies-Bouldin + gap statistic + ARI between seeds; pick the k whose stability is highest, then look at silhouette
  • reports the ARI across seeds for the chosen k, not a single-seed silhouette
  • uses the gap statistic to rule out the "data has no real clusters" case — if the gap peaks at k=2 and plateaus, don't sell the business team a k=5 story
  • picks a k one step below the silhouette max if the ARI at the max is unstable (e.g., k=5 has higher silhouette but ARI 0.6; k=4 has silhouette 0.34 and ARI 0.84 — ship k=4)
  • considers a second algorithm (agglomerative, GMM) as a cross-check; if the two algorithms agree on k and the segmentation, the call is real
  • is willing to report "the data does not support a clean segmentation — use the top-level 2-way split for campaigns and drop below-threshold sub-segments" when the stability analysis fails

Common Wrong Moves

  • picking k by single-seed silhouette and treating the number as the ground truth
  • ignoring seed variance because "KMeans converges anyway" — it does, but to different local optima
  • running one seed and stopping because the elbow "looked clear"
  • picking the k with the best silhouette even when ARI across seeds is 0.55 — you are shipping a seed, not a segmentation
  • omitting the gap statistic and missing the case where no clustering is better than random
  • shipping the k=5 segmentation on Friday because "the deadline is Friday" without flagging the 0.52 ARI to the business team

Run The Clinic In Browser

Use the runner to sweep k, compute ARI across 20 seeds per k, plot the stability-vs-k curve, and surface the gap statistic.

Reference Reveal

Open only after you write the defense The reference procedure is a **stability sweep**, not a single-seed pick: 1. for `k` in `{2..10}`: run `KMeans(n_init=10)` with 20 different `random_state` values, collect the cluster labels for each run 2. compute pairwise ARI across the 20 runs for each `k`; report mean and min 3. compute silhouette and gap statistic for each `k` 4. pick the `k` where **mean ARI ≥ 0.8 AND gap statistic > 0 AND silhouette ≥ 0.25** — in that priority order On the artifact packet above, the expected findings: | k | silhouette | mean ARI (20 seeds) | gap statistic | call | | --- | --- | --- | --- | --- | | 2 | 0.28 | **0.94** | 0.18 | stable, weak silhouette | | 3 | 0.30 | 0.82 | 0.21 | stable, reasonable silhouette | | 4 | **0.34** | 0.71 | 0.15 | borderline stability | | 5 | 0.31 | 0.56 | 0.09 | **unstable — reject** | | 6 | 0.27 | 0.48 | 0.04 | unstable | | 7+ | < 0.27 | < 0.45 | ≤ 0 | noise | The reference call is **k=3**. Why: - k=3 is the highest k where mean ARI ≥ 0.8 (a stability floor) and the gap statistic is positive - k=4 has the highest silhouette but the ARI of 0.71 means 29% of the structure reshuffles across seeds — too fragile for a quarterly campaign build - k=5 (your pick) and k=4 (teammate's pick) are *both* being chosen above the stability floor; the business team should ship at k=3 What to tell the business team: - "The data supports 3 segments robustly. 4 segments is the silhouette-best but reshuffles ~30% of the middle two segments across runs — building campaigns on that will feel inconsistent." - "If 3 segments is too coarse, the right next move is more features or a longer time window, not more clusters." - "Here are descriptive names for the 3 segments, with N and key features — and here's the top-level cross-tab of how they map to your current customer profiles." Escalation if the stability sweep also fails at k=3: - the data has no discrete clusters — it is a continuous space - the right tool is not clustering but **scoring models**: predict a target (churn, LTV, engagement) and segment by predicted percentile - tell the business team clustering is the wrong frame for their question The practical lesson: **clustering is a hypothesis test, not a rendering**. A single-seed KMeans plot is a guess. A stability sweep is a decision. If the stability sweep fails, the honest answer is "no clusters" — and that is usually a better story than a fragile one.

What To Do Next

  1. open Clustering and Low-Dimensional Views for the elbow/silhouette/gap workflow
  2. open Advanced Clustering and Dimensionality Reduction for hierarchical and density-based alternatives
  3. open Manifold Choice — the adjacent clinic on projection-method selection
  4. run the stability sweep on your own data; the ARI-across-seeds curve is the single most informative number for choosing k