Dimensionality Reduction¶

What This Is¶

Dimensionality reduction projects high-dimensional data into fewer dimensions while preserving as much useful structure as possible. The goal is to simplify the data for visualization, faster modeling, or noise reduction — not to create a black box.

The decision this page forces is narrow:

which reduction method matches what kind of structure you want to keep, and which downstream step is going to consume the output

PCA for "how much variance is on a few axes"; kernel PCA for "the manifold is curved"; random projections for "I just need cheap compression"; t-SNE / UMAP for "I want a picture."

When You Use It¶

visualizing high-dimensional clusters or class separations
removing noisy or redundant features before modeling
compressing features for faster training
denoising (PCA reconstruction discards low-variance directions)
debugging a model by inspecting what the data "looks like" in 2D
checking whether classes are separable before building a complex model

Do Not Use It When¶

the downstream model (e.g., tree-based) does not benefit from orthogonal axes
the reduced dimensionality makes the task harder — compression that discards signal is not denoising, it is destruction
you need interpretable features — a principal component is a linear blend, not a named concept
the method is non-deterministic (t-SNE, UMAP) and you want reproducible training features

Method	Preserves	Linear?	Invertible?	Best for
PCA	global variance	yes	yes (approximate)	linear compression, denoising, preprocessing
Kernel PCA	variance in a nonlinear feature space	no (kernel trick)	partially	curved manifolds, nonlinear denoising
Random projections	pairwise distances (Johnson–Lindenstrauss)	yes	no	cheap compression on very wide data
t-SNE	local neighborhood	no	no	2D visualization of clusters
UMAP	local + some global	no	no	2D/3D viz, faster than t-SNE
LLE / Isomap	local manifold geometry	no	no	classical manifold learning demos
Autoencoder	reconstruction under a learned code	no	yes	learned, nonlinear, task-shaped

PCA — Start Here¶

PCA finds the directions of maximum variance and projects data onto them. It is linear, fast, invertible, and always the first thing to try.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total:              {pca.explained_variance_ratio_.sum():.3f}")

Choosing n_components — explained variance vs. reconstruction¶

Two honest ways to pick the component count:

# (1) keep a variance fraction
pca = PCA(n_components=0.95).fit(X_scaled)
print(f"Components kept: {pca.n_components_}")

# (2) keep components until reconstruction error stops dropping
from sklearn.metrics import mean_squared_error
import numpy as np
errs = []
for k in [1, 2, 4, 8, 16, 32, 64]:
    p = PCA(n_components=k).fit(X_scaled)
    X_rec = p.inverse_transform(p.transform(X_scaled))
    errs.append(mean_squared_error(X_scaled, X_rec))

Explained-variance is the faster pick but blind to where the variance sits. Reconstruction error catches cases where the first few components capture variance that is not the signal. If the two disagree, inspect the task — for prediction, validation score on the downstream model is the final arbiter.

PCA failure modes¶

the structure is nonlinear (spirals, nested clusters) — PCA's axes miss the manifold
variance is dominated by noise on a few features — PCA happily projects onto noise
features live on very different scales — PCA without StandardScaler is meaningless

Kernel PCA — When The Manifold Is Curved¶

Kernel PCA applies PCA in a nonlinear feature space via the kernel trick. It captures curved structure that standard PCA misses.

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1, fit_inverse_transform=True)
X_kpca = kpca.fit_transform(X_scaled)

Use it when:

the classes are not linearly separable in the original space
you want to preprocess for a linear classifier on a nonlinear problem
you need a reduction with a well-defined out-of-sample transform (unlike t-SNE / UMAP)

Caveats:

gamma is the sensitive knob — too small collapses everything, too large produces noise
fitting scales as O(N²) in memory for the kernel matrix; unsuitable for very large datasets
inverse transform is approximate and only available when requested at fit time

Random Projections — Cheap Compression At Scale¶

The Johnson–Lindenstrauss lemma says: you can project N points from any dimension down to O(log N / ε²) dimensions and approximately preserve pairwise distances, using a random linear map. No training, no SVD.

from sklearn.random_projection import GaussianRandomProjection, johnson_lindenstrauss_min_dim

d = johnson_lindenstrauss_min_dim(n_samples=10000, eps=0.1)
rp = GaussianRandomProjection(n_components=d, random_state=0)
X_rp = rp.fit_transform(X_wide)

When random projections win:

the feature count is in the tens of thousands (TF-IDF, hashed features, bag-of-words)
PCA's SVD is too slow or memory-intensive
any distance-preserving embedding is enough for the downstream task (kNN, linear classifier with StandardScaler)

When they lose: small feature counts and tasks that need variance-structured axes. PCA's first two components can be interpretable; a random-projection axis is random.

t-SNE — For Visualization Only¶

t-SNE maps points so that nearby neighbors in high-dimensional space stay close in 2D. It is nonlinear and non-invertible.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_2d = tsne.fit_transform(X_scaled)

perplexity is the effective neighbor count — ~5 to ~50 typical. Low values emphasize tight local structure, high values broader patterns.

Trust and distrust carefully:

cluster membership — usually meaningful if clearly separated
distances between clusters — not meaningful
cluster sizes and shapes — depend on density, not real geometry
axes — no fixed interpretation

UMAP — Faster Alternative¶

UMAP is similar in spirit to t-SNE but faster, handles out-of-sample transform, and usually preserves more global structure.

import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0)
X_2d = reducer.fit_transform(X_scaled)

n_neighbors — local vs global balance; higher preserves more global structure
min_dist — cluster tightness; lower packs clusters tighter

UMAP can be used as a preprocessor for classifiers in a way t-SNE cannot, because it has a sensible transform() for new data — but the reduction is still stochastic per fit, so treat it as one seed in an inspection, not a stable feature.

The Reduction Ladder¶

Start with PCA to check the variance structure and get a fast linear baseline.
Try kernel PCA or an autoencoder if the manifold is curved and you want to stay quantitative.
Reach for random projections only when the feature count is huge and you need speed.
Reach for t-SNE or UMAP only for visualization — not as preprocessing for classifiers (except carefully with UMAP).
Always scale first with StandardScaler before any distance-based reduction.

What To Inspect¶

the explained-variance curve — does it bend sharply at a small k?
the reconstruction error curve on held-out data — should match the train curve closely
per-class density in the 2D view — if classes overlap at every scale, no downstream classifier will beat the baseline
stability across seeds — stochastic methods (t-SNE, UMAP, autoencoders) should produce the same story across seeds, even if exact coordinates differ
the downstream model's score with and without the reduction — the only final judge of whether the reduction kept the signal

Failure Pattern¶

The canonical failure is using t-SNE output as features for a classifier. t-SNE is a visualization tool — the embedding changes with hyperparameters and seeds, and transform() for new data does not exist.

The second canonical failure is interpreting inter-cluster distances in a t-SNE plot as meaningful. They are not; t-SNE explicitly distorts them.

A third, more subtle failure is PCA on unscaled data. A single feature with a huge scale dominates the first component. Always StandardScaler before PCA unless the scales are already comparable.

Common Mistakes¶

forgetting to scale before PCA
using too many PCA components and keeping all the noise
treating t-SNE clusters as ground truth for labeling
using t-SNE on very large datasets without downsampling first
comparing two t-SNE plots with different perplexity values as if they show the same thing
deploying UMAP features into a production model without locking seed and version
using random projections on small or well-curated feature sets (PCA beats them there)

Decision: Which Reduction¶

Goal	Pick	Why
compress for a downstream classifier	PCA	linear, invertible, stable
the manifold is clearly curved	kernel PCA or an autoencoder	nonlinear structure preserved
thousands of sparse features, need speed	random projections	cheap, distance-preserving
visualize local cluster structure	t-SNE	strongest for tight local groups
visualize global + local, or want `.transform()`	UMAP	faster, slightly more global
denoise images / signals with reconstruction	PCA or an autoencoder	reconstruction error is the objective

Practice¶

Apply PCA to a dataset and plot the explained-variance curve. How many components capture 90%? How many capture 99%?
Run PCA with n_components=0.95 and compare the downstream model's CV score to using the full feature set. Did the reduction help, hurt, or tie?
Build a non-linearly separable toy dataset (concentric circles). Run PCA and kernel PCA with an RBF kernel; plot both. Explain the difference.
Generate a 50k-row × 20k-feature sparse matrix. Time PCA vs. GaussianRandomProjection to 100 dimensions. Report both times and a downstream kNN score.
Visualize the same data with t-SNE at perplexity 5, 30, and 100. Name what each emphasizes and what it distorts.
Compare PCA-2D, UMAP-2D, and t-SNE-2D on the same dataset. Which reveals the cleanest separation? Which gives the most reliable downstream features?
Explain in three sentences why t-SNE embeddings are a poor choice for model inputs.

Runnable Example¶

Longer Connection¶

Dimensionality reduction sits next to:

Clustering and Low-Dimensional Views — the clustering step that usually follows a reduction
Advanced Clustering and Dimensionality Reduction — the deeper cut into HDBSCAN, UMAP, and evaluation discipline
Feature Selection — the alternative way to reduce dimensionality by dropping features rather than transforming them
Autoencoders and VAEs — the learned-reduction cousin; autoencoders are nonlinear PCA in the modern setup
Self-Supervised and Representation Learning — when the "reduction" you want is a learned embedding, not a statistical projection

The reduction is not the point. The next step after the reduction — the classifier, the cluster, the visualization, the downstream decision — is the point. If the reduction does not improve that step, do not ship it.