Skip to content

Dimensionality Reduction

What This Is

Dimensionality reduction projects high-dimensional data into fewer dimensions while preserving as much useful structure as possible. The goal is to simplify the data for visualization, faster modeling, or noise reduction — not to create a black box.

The decision this page forces is narrow:

  • which reduction method matches what kind of structure you want to keep, and which downstream step is going to consume the output

PCA for "how much variance is on a few axes"; kernel PCA for "the manifold is curved"; random projections for "I just need cheap compression"; t-SNE / UMAP for "I want a picture."

When You Use It

  • visualizing high-dimensional clusters or class separations
  • removing noisy or redundant features before modeling
  • compressing features for faster training
  • denoising (PCA reconstruction discards low-variance directions)
  • debugging a model by inspecting what the data "looks like" in 2D
  • checking whether classes are separable before building a complex model

Do Not Use It When

  • the downstream model (e.g., tree-based) does not benefit from orthogonal axes
  • the reduced dimensionality makes the task harder — compression that discards signal is not denoising, it is destruction
  • you need interpretable features — a principal component is a linear blend, not a named concept
  • the method is non-deterministic (t-SNE, UMAP) and you want reproducible training features

The Method Menu

Method Preserves Linear? Invertible? Best for
PCA global variance yes yes (approximate) linear compression, denoising, preprocessing
Kernel PCA variance in a nonlinear feature space no (kernel trick) partially curved manifolds, nonlinear denoising
Random projections pairwise distances (Johnson–Lindenstrauss) yes no cheap compression on very wide data
t-SNE local neighborhood no no 2D visualization of clusters
UMAP local + some global no no 2D/3D viz, faster than t-SNE
LLE / Isomap local manifold geometry no no classical manifold learning demos
Autoencoder reconstruction under a learned code no yes learned, nonlinear, task-shaped

PCA — Start Here

PCA finds the directions of maximum variance and projects data onto them. It is linear, fast, invertible, and always the first thing to try.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X_scaled)
print(f"Explained variance: {pca.explained_variance_ratio_}")
print(f"Total:              {pca.explained_variance_ratio_.sum():.3f}")

Choosing n_components — explained variance vs. reconstruction

Two honest ways to pick the component count:

# (1) keep a variance fraction
pca = PCA(n_components=0.95).fit(X_scaled)
print(f"Components kept: {pca.n_components_}")

# (2) keep components until reconstruction error stops dropping
from sklearn.metrics import mean_squared_error
import numpy as np
errs = []
for k in [1, 2, 4, 8, 16, 32, 64]:
    p = PCA(n_components=k).fit(X_scaled)
    X_rec = p.inverse_transform(p.transform(X_scaled))
    errs.append(mean_squared_error(X_scaled, X_rec))

Explained-variance is the faster pick but blind to where the variance sits. Reconstruction error catches cases where the first few components capture variance that is not the signal. If the two disagree, inspect the task — for prediction, validation score on the downstream model is the final arbiter.

PCA failure modes

  • the structure is nonlinear (spirals, nested clusters) — PCA's axes miss the manifold
  • variance is dominated by noise on a few features — PCA happily projects onto noise
  • features live on very different scales — PCA without StandardScaler is meaningless

Kernel PCA — When The Manifold Is Curved

Kernel PCA applies PCA in a nonlinear feature space via the kernel trick. It captures curved structure that standard PCA misses.

from sklearn.decomposition import KernelPCA

kpca = KernelPCA(n_components=2, kernel="rbf", gamma=0.1, fit_inverse_transform=True)
X_kpca = kpca.fit_transform(X_scaled)

Use it when:

  • the classes are not linearly separable in the original space
  • you want to preprocess for a linear classifier on a nonlinear problem
  • you need a reduction with a well-defined out-of-sample transform (unlike t-SNE / UMAP)

Caveats:

  • gamma is the sensitive knob — too small collapses everything, too large produces noise
  • fitting scales as O(N²) in memory for the kernel matrix; unsuitable for very large datasets
  • inverse transform is approximate and only available when requested at fit time

Random Projections — Cheap Compression At Scale

The Johnson–Lindenstrauss lemma says: you can project N points from any dimension down to O(log N / ε²) dimensions and approximately preserve pairwise distances, using a random linear map. No training, no SVD.

from sklearn.random_projection import GaussianRandomProjection, johnson_lindenstrauss_min_dim

d = johnson_lindenstrauss_min_dim(n_samples=10000, eps=0.1)
rp = GaussianRandomProjection(n_components=d, random_state=0)
X_rp = rp.fit_transform(X_wide)

When random projections win:

  • the feature count is in the tens of thousands (TF-IDF, hashed features, bag-of-words)
  • PCA's SVD is too slow or memory-intensive
  • any distance-preserving embedding is enough for the downstream task (kNN, linear classifier with StandardScaler)

When they lose: small feature counts and tasks that need variance-structured axes. PCA's first two components can be interpretable; a random-projection axis is random.

t-SNE — For Visualization Only

t-SNE maps points so that nearby neighbors in high-dimensional space stay close in 2D. It is nonlinear and non-invertible.

from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, perplexity=30, random_state=0)
X_2d = tsne.fit_transform(X_scaled)

perplexity is the effective neighbor count — ~5 to ~50 typical. Low values emphasize tight local structure, high values broader patterns.

Trust and distrust carefully:

  • cluster membership — usually meaningful if clearly separated
  • distances between clusters — not meaningful
  • cluster sizes and shapes — depend on density, not real geometry
  • axes — no fixed interpretation

UMAP — Faster Alternative

UMAP is similar in spirit to t-SNE but faster, handles out-of-sample transform, and usually preserves more global structure.

import umap
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=0)
X_2d = reducer.fit_transform(X_scaled)
  • n_neighbors — local vs global balance; higher preserves more global structure
  • min_dist — cluster tightness; lower packs clusters tighter

UMAP can be used as a preprocessor for classifiers in a way t-SNE cannot, because it has a sensible transform() for new data — but the reduction is still stochastic per fit, so treat it as one seed in an inspection, not a stable feature.

The Reduction Ladder

  1. Start with PCA to check the variance structure and get a fast linear baseline.
  2. Try kernel PCA or an autoencoder if the manifold is curved and you want to stay quantitative.
  3. Reach for random projections only when the feature count is huge and you need speed.
  4. Reach for t-SNE or UMAP only for visualization — not as preprocessing for classifiers (except carefully with UMAP).
  5. Always scale first with StandardScaler before any distance-based reduction.

What To Inspect

  • the explained-variance curve — does it bend sharply at a small k?
  • the reconstruction error curve on held-out data — should match the train curve closely
  • per-class density in the 2D view — if classes overlap at every scale, no downstream classifier will beat the baseline
  • stability across seeds — stochastic methods (t-SNE, UMAP, autoencoders) should produce the same story across seeds, even if exact coordinates differ
  • the downstream model's score with and without the reduction — the only final judge of whether the reduction kept the signal

Failure Pattern

The canonical failure is using t-SNE output as features for a classifier. t-SNE is a visualization tool — the embedding changes with hyperparameters and seeds, and transform() for new data does not exist.

The second canonical failure is interpreting inter-cluster distances in a t-SNE plot as meaningful. They are not; t-SNE explicitly distorts them.

A third, more subtle failure is PCA on unscaled data. A single feature with a huge scale dominates the first component. Always StandardScaler before PCA unless the scales are already comparable.

Common Mistakes

  • forgetting to scale before PCA
  • using too many PCA components and keeping all the noise
  • treating t-SNE clusters as ground truth for labeling
  • using t-SNE on very large datasets without downsampling first
  • comparing two t-SNE plots with different perplexity values as if they show the same thing
  • deploying UMAP features into a production model without locking seed and version
  • using random projections on small or well-curated feature sets (PCA beats them there)

Decision: Which Reduction

Goal Pick Why
compress for a downstream classifier PCA linear, invertible, stable
the manifold is clearly curved kernel PCA or an autoencoder nonlinear structure preserved
thousands of sparse features, need speed random projections cheap, distance-preserving
visualize local cluster structure t-SNE strongest for tight local groups
visualize global + local, or want .transform() UMAP faster, slightly more global
denoise images / signals with reconstruction PCA or an autoencoder reconstruction error is the objective

Practice

  1. Apply PCA to a dataset and plot the explained-variance curve. How many components capture 90%? How many capture 99%?
  2. Run PCA with n_components=0.95 and compare the downstream model's CV score to using the full feature set. Did the reduction help, hurt, or tie?
  3. Build a non-linearly separable toy dataset (concentric circles). Run PCA and kernel PCA with an RBF kernel; plot both. Explain the difference.
  4. Generate a 50k-row × 20k-feature sparse matrix. Time PCA vs. GaussianRandomProjection to 100 dimensions. Report both times and a downstream kNN score.
  5. Visualize the same data with t-SNE at perplexity 5, 30, and 100. Name what each emphasizes and what it distorts.
  6. Compare PCA-2D, UMAP-2D, and t-SNE-2D on the same dataset. Which reveals the cleanest separation? Which gives the most reliable downstream features?
  7. Explain in three sentences why t-SNE embeddings are a poor choice for model inputs.

Runnable Example

Longer Connection

Dimensionality reduction sits next to:

The reduction is not the point. The next step after the reduction — the classifier, the cluster, the visualization, the downstream decision — is the point. If the reduction does not improve that step, do not ship it.