Dimensionality Reduction¶
What This Is¶
Dimensionality reduction projects high-dimensional data into fewer dimensions while preserving as much useful structure as possible. The goal is to simplify the data for visualization, faster modeling, or noise reduction — not to create a black box.
The decision this page forces is narrow:
- which reduction method matches what kind of structure you want to keep, and which downstream step is going to consume the output
PCA for "how much variance is on a few axes"; kernel PCA for "the manifold is curved"; random projections for "I just need cheap compression"; t-SNE / UMAP for "I want a picture."
When You Use It¶
- visualizing high-dimensional clusters or class separations
- removing noisy or redundant features before modeling
- compressing features for faster training
- denoising (PCA reconstruction discards low-variance directions)
- debugging a model by inspecting what the data "looks like" in 2D
- checking whether classes are separable before building a complex model
Do Not Use It When¶
- the downstream model (e.g., tree-based) does not benefit from orthogonal axes
- the reduced dimensionality makes the task harder — compression that discards signal is not denoising, it is destruction
- you need interpretable features — a principal component is a linear blend, not a named concept
- the method is non-deterministic (t-SNE, UMAP) and you want reproducible training features
The Method Menu¶
| Method | Preserves | Linear? | Invertible? | Best for |
|---|---|---|---|---|
| PCA | global variance | yes | yes (approximate) | linear compression, denoising, preprocessing |
| Kernel PCA | variance in a nonlinear feature space | no (kernel trick) | partially | curved manifolds, nonlinear denoising |
| Random projections | pairwise distances (Johnson–Lindenstrauss) | yes | no | cheap compression on very wide data |
| t-SNE | local neighborhood | no | no | 2D visualization of clusters |
| UMAP | local + some global | no | no | 2D/3D viz, faster than t-SNE |
| LLE / Isomap | local manifold geometry | no | no | classical manifold learning demos |
| Autoencoder | reconstruction under a learned code | no | yes | learned, nonlinear, task-shaped |
PCA — Start Here¶
PCA finds the directions of maximum variance and projects data onto them. It is linear, fast, invertible, and always the first thing to try.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total: {pca.explained_variance_ratio_.sum():.3f}")
Choosing n_components — explained variance vs. reconstruction¶
Two honest ways to pick the component count:
# (1) keep a variance fraction
pca = PCA(n_components=0.95).fit(X_scaled)
print(f"Components kept: {pca.n_components_}")
# (2) keep components until reconstruction error stops dropping
from sklearn.metrics import mean_squared_error
import numpy as np
errs = []
for k in [1, 2, 4, 8, 16, 32, 64]:
p = PCA(n_components=k).fit(X_scaled)
X_rec = p.inverse_transform(p.transform(X_scaled))
errs.append(mean_squared_error(X_scaled, X_rec))
Explained-variance is the faster pick but blind to where the variance sits. Reconstruction error catches cases where the first few components capture variance that is not the signal. If the two disagree, inspect the task — for prediction, validation score on the downstream model is the final arbiter.
PCA failure modes¶
- the structure is nonlinear (spirals, nested clusters) — PCA's axes miss the manifold
- variance is dominated by noise on a few features — PCA happily projects onto noise
- features live on very different scales — PCA without
StandardScaleris meaningless
Kernel PCA — When The Manifold Is Curved¶
Kernel PCA applies PCA in a nonlinear feature space via the kernel trick. It captures curved structure that standard PCA misses.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1, fit_inverse_transform=True)
X_kpca = kpca.fit_transform(X_scaled)
Use it when:
- the classes are not linearly separable in the original space
- you want to preprocess for a linear classifier on a nonlinear problem
- you need a reduction with a well-defined out-of-sample transform (unlike t-SNE / UMAP)
Caveats:
gammais the sensitive knob — too small collapses everything, too large produces noise- fitting scales as O(N²) in memory for the kernel matrix; unsuitable for very large datasets
- inverse transform is approximate and only available when requested at fit time
Random Projections — Cheap Compression At Scale¶
The Johnson–Lindenstrauss lemma says: you can project N points from any dimension down to O(log N / ε²) dimensions and approximately preserve pairwise distances, using a random linear map. No training, no SVD.
from sklearn.random_projection import GaussianRandomProjection, johnson_lindenstrauss_min_dim
d = johnson_lindenstrauss_min_dim(n_samples=10000, eps=0.1)
rp = GaussianRandomProjection(n_components=d, random_state=0)
X_rp = rp.fit_transform(X_wide)
When random projections win:
- the feature count is in the tens of thousands (TF-IDF, hashed features, bag-of-words)
- PCA's SVD is too slow or memory-intensive
- any distance-preserving embedding is enough for the downstream task (kNN, linear classifier with
StandardScaler)
When they lose: small feature counts and tasks that need variance-structured axes. PCA's first two components can be interpretable; a random-projection axis is random.
t-SNE — For Visualization Only¶
t-SNE maps points so that nearby neighbors in high-dimensional space stay close in 2D. It is nonlinear and non-invertible.
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_2d = tsne.fit_transform(X_scaled)
perplexity is the effective neighbor count — ~5 to ~50 typical. Low values emphasize tight local structure, high values broader patterns.
Trust and distrust carefully:
- cluster membership — usually meaningful if clearly separated
- distances between clusters — not meaningful
- cluster sizes and shapes — depend on density, not real geometry
- axes — no fixed interpretation
UMAP — Faster Alternative¶
UMAP is similar in spirit to t-SNE but faster, handles out-of-sample transform, and usually preserves more global structure.
import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0)
X_2d = reducer.fit_transform(X_scaled)
n_neighbors— local vs global balance; higher preserves more global structuremin_dist— cluster tightness; lower packs clusters tighter
UMAP can be used as a preprocessor for classifiers in a way t-SNE cannot, because it has a sensible transform() for new data — but the reduction is still stochastic per fit, so treat it as one seed in an inspection, not a stable feature.
The Reduction Ladder¶
- Start with PCA to check the variance structure and get a fast linear baseline.
- Try kernel PCA or an autoencoder if the manifold is curved and you want to stay quantitative.
- Reach for random projections only when the feature count is huge and you need speed.
- Reach for t-SNE or UMAP only for visualization — not as preprocessing for classifiers (except carefully with UMAP).
- Always scale first with
StandardScalerbefore any distance-based reduction.
What To Inspect¶
- the explained-variance curve — does it bend sharply at a small k?
- the reconstruction error curve on held-out data — should match the train curve closely
- per-class density in the 2D view — if classes overlap at every scale, no downstream classifier will beat the baseline
- stability across seeds — stochastic methods (t-SNE, UMAP, autoencoders) should produce the same story across seeds, even if exact coordinates differ
- the downstream model's score with and without the reduction — the only final judge of whether the reduction kept the signal
Failure Pattern¶
The canonical failure is using t-SNE output as features for a classifier. t-SNE is a visualization tool — the embedding changes with hyperparameters and seeds, and transform() for new data does not exist.
The second canonical failure is interpreting inter-cluster distances in a t-SNE plot as meaningful. They are not; t-SNE explicitly distorts them.
A third, more subtle failure is PCA on unscaled data. A single feature with a huge scale dominates the first component. Always StandardScaler before PCA unless the scales are already comparable.
Common Mistakes¶
- forgetting to scale before PCA
- using too many PCA components and keeping all the noise
- treating t-SNE clusters as ground truth for labeling
- using t-SNE on very large datasets without downsampling first
- comparing two t-SNE plots with different perplexity values as if they show the same thing
- deploying UMAP features into a production model without locking seed and version
- using random projections on small or well-curated feature sets (PCA beats them there)
Decision: Which Reduction¶
| Goal | Pick | Why |
|---|---|---|
| compress for a downstream classifier | PCA | linear, invertible, stable |
| the manifold is clearly curved | kernel PCA or an autoencoder | nonlinear structure preserved |
| thousands of sparse features, need speed | random projections | cheap, distance-preserving |
| visualize local cluster structure | t-SNE | strongest for tight local groups |
visualize global + local, or want .transform() |
UMAP | faster, slightly more global |
| denoise images / signals with reconstruction | PCA or an autoencoder | reconstruction error is the objective |
Practice¶
- Apply PCA to a dataset and plot the explained-variance curve. How many components capture 90%? How many capture 99%?
- Run PCA with
n_components=0.95and compare the downstream model's CV score to using the full feature set. Did the reduction help, hurt, or tie? - Build a non-linearly separable toy dataset (concentric circles). Run PCA and kernel PCA with an RBF kernel; plot both. Explain the difference.
- Generate a 50k-row × 20k-feature sparse matrix. Time PCA vs.
GaussianRandomProjectionto 100 dimensions. Report both times and a downstream kNN score. - Visualize the same data with t-SNE at perplexity 5, 30, and 100. Name what each emphasizes and what it distorts.
- Compare PCA-2D, UMAP-2D, and t-SNE-2D on the same dataset. Which reveals the cleanest separation? Which gives the most reliable downstream features?
- Explain in three sentences why t-SNE embeddings are a poor choice for model inputs.
Runnable Example¶
Longer Connection¶
Dimensionality reduction sits next to:
- Clustering and Low-Dimensional Views — the clustering step that usually follows a reduction
- Advanced Clustering and Dimensionality Reduction — the deeper cut into HDBSCAN, UMAP, and evaluation discipline
- Feature Selection — the alternative way to reduce dimensionality by dropping features rather than transforming them
- Autoencoders and VAEs — the learned-reduction cousin; autoencoders are nonlinear PCA in the modern setup
- Self-Supervised and Representation Learning — when the "reduction" you want is a learned embedding, not a statistical projection
The reduction is not the point. The next step after the reduction — the classifier, the cluster, the visualization, the downstream decision — is the point. If the reduction does not improve that step, do not ship it.