Semi-Supervised Learning¶

A small pool of labeled examples plus a large pool of unlabeled examples is not "not enough labels" — it is a specific regime with its own named techniques. Use them instead of throwing the unlabeled data away.

What This Is¶

You have n_L labeled rows and n_U unlabeled rows, with n_U >> n_L. A supervised model trained on the labeled rows alone hits a ceiling fast. Semi-supervised methods treat the unlabeled pool as a second source of signal — either by pseudo-labeling it, by using it to shape the decision surface, or by pretraining a representation on it.

The three families you should know by name:

Self-training (pseudo-labeling). Train on labeled data. Predict on unlabeled. Keep predictions whose confidence clears a threshold. Add those pseudo-labeled rows to the training set. Retrain. Iterate until nothing new is added or quality on a held-out labeled set starts dropping.
Label propagation / label spreading. Build a similarity graph over all points (labeled + unlabeled). Let labels flow along the graph edges until stable. The unlabeled points inherit labels from their neighborhood structure. Works well when the manifold assumption holds (points close in feature space share a label).
Consistency regularization. Push the model to produce similar outputs on two perturbed versions of the same unlabeled input. Families include FixMatch, MixMatch, Π-model. Stronger than self-training when augmentations are meaningful.

A fourth move that is not strictly SSL but often beats it: train a self-supervised representation on all the unlabeled data, then fit a linear head on the small labeled set. See Self-Supervised and Representation Learning.

When You Use It¶

labeled data is expensive or slow to collect, unlabeled is cheap
the supervised model plateaus fast and you suspect it is label-limited, not capacity-limited
the unlabeled distribution matches the labeled one (same population, same feature definitions)

Do Not Use It When¶

labels are cheap — just collect more
the labeled and unlabeled pools come from different distributions (pseudo-labels will drift)
your supervised model is already near its Bayes-error floor — SSL cannot invent signal that is not there
you cannot hold out a clean labeled set for validation — without it you cannot tell if pseudo-labels are helping or hurting

Pseudo-Labeling In Practice¶

The core loop:

split labeled data into train / val
train model on train
predict on unlabeled; keep rows where max(softmax) >= τ (often τ ≈ 0.9)
add kept rows with their predicted label to the training set
retrain from scratch (or warm-start)
evaluate on val; if val improved, iterate; if it dropped, stop and use the previous round's model

The threshold τ matters. Too low and you inject label noise. Too high and no new rows make it in. A common pattern is to start high and lower over rounds as the model gets more confident.

Label Propagation In Practice¶

from sklearn.semi_supervised import LabelSpreading

clf = LabelSpreading(kernel="knn", n_neighbors=7, alpha=0.2)
# y: labeled = class id, unlabeled = -1
clf.fit(X, y)
y_pred = clf.transduction_  # per-row predicted label including previously unlabeled

The hard part is the graph. Euclidean distance on raw features often does not represent semantic similarity. Scaling, PCA, or a learned embedding usually helps.

What To Inspect¶

labeled val accuracy vs round — should climb, then plateau, then drop if you overrun
pseudo-label class distribution — if one class is getting 80% of pseudo-labels and it is not the majority class, drift is happening
pseudo-label confidence distribution — a big peak near the threshold means τ is near a cliff; shift it to see
agreement between rounds — if round N flips many labels that round N-1 assigned, the model is not stabilizing
held-out error on the original labeled test split — this is the only number that matters at the end

Failure Pattern¶

confirmation bias. The model is confident about wrong predictions; those get pseudo-labeled in; the model gets more confident about the same wrong thing. The pool poisons itself. Symptom: val accuracy climbs for a round or two then drops.
class imbalance amplification. The majority class starts slightly over-represented in pseudo-labels; next round the minority classes shrink; eventually you have a majority-class-predictor that looks confident. Fix: per-class thresholds or balanced sampling.
distribution shift. Unlabeled comes from a slightly different source (time, geography, sensor). Pseudo-labels push the model toward the new distribution, which hurts on the original labeled test.
no validation signal. You ran 10 rounds and picked "the last one" without checking val. You have no evidence it is better than round 0.

Quick Checks¶

before pseudo-labeling: is there a clean labeled val set that will NOT be contaminated?
is the unlabeled pool from the same distribution as the labeled pool?
is the initial supervised model well-calibrated enough for confidence thresholds to be meaningful?
does each round of pseudo-labeling use a fresh model (trained from scratch) or warm-start? (either is fine, but know which)
is τ fixed or annealed across rounds?

Practice¶

Run academy/labs/semi-supervised-learning/src/ssl_workflow.py:

it generates a synthetic 2D two-moons dataset with only 20 labeled points and 2000 unlabeled
it compares four strategies: labeled-only logistic regression, self-training with τ=0.9, self-training with per-class balanced thresholds, and label spreading
it reports accuracy on a held-out clean labeled set and the round-by-round val trajectory

After running it, you should be able to explain when pseudo-labeling helps, when it hurts, and how confirmation bias looks in the round-by-round curve.

Longer Connection¶

Semi-supervised learning is one answer to "labels are expensive." The others:

active learning — ask a human to label the points the model is most uncertain about, not random ones
weak supervision — write heuristic labeling functions over the unlabeled pool and denoise them
self-supervised pretraining — train a representation on the unlabeled pool with no labels, then linear-probe

These are not mutually exclusive. A production setup often stacks self-supervised pretraining + pseudo-labeling + active-learning acquisition. The exam setting typically gives you one labeled pool and one unlabeled pool and asks you to do the best you can — in that case, start with pseudo-labeling on a well-calibrated model and a real held-out val set.