Skip to main content
Malaysia
AIMenta

Dimensionality Reduction

Techniques (PCA, t-SNE, UMAP) that compress high-dimensional vectors into 2D or 3D for visualization, or shrink embeddings for storage and speed.

Dimensionality reduction transforms high-dimensional data into a lower-dimensional representation while preserving as much relevant structure as possible. The motivation is practical: high-dimensional data is expensive to store, slow to compute over, and prone to the **curse of dimensionality** — in high dimensions, distances become uniformly large and data becomes sparse, undermining the distance-based algorithms that power much of ML.

## Why dimensionality reduction matters

A raw image might have 224×224×3 = 150,528 pixel values. A tokenised document might have tens of thousands of features. A financial time series monitoring 1,000 instruments over years accumulates millions of measurements. Operating on raw features is computationally wasteful and statistically noisy — most dimensions carry redundant or irrelevant information.

Dimensionality reduction serves two related goals: **compression** (fewer numbers to store and compute) and **learning** (eliminating noise so models can find genuine signal).

## Principal Component Analysis (PCA)

PCA is the canonical linear dimensionality reduction method. It finds the orthogonal directions of maximum variance in the data (the principal components) and projects the data onto the top-k components. The result is a low-dimensional representation that captures as much variance as possible.

PCA is:
- Computationally efficient (via SVD)
- Deterministic and interpretable
- Constrained to linear transformations — it cannot capture curved manifolds

Practical uses: pre-processing before clustering, visualisation (projecting to 2D/3D), noise reduction in sensor data, and reducing embedding dimensions for storage efficiency.

## Non-linear methods

- **t-SNE** (t-Distributed Stochastic Neighbour Embedding): preserves local structure — points that are close in high-dimensional space remain close in the 2D embedding. Excellent for visualising cluster structure in embeddings. Not suitable for general compression (non-parametric, does not generalise to new points).
- **UMAP** (Uniform Manifold Approximation and Projection): faster than t-SNE, preserves more global structure, and can be used as a general-purpose transform applied to new data. The current standard for embedding visualisation.
- **Autoencoders**: neural networks trained to reconstruct their input through a lower-dimensional bottleneck layer. The bottleneck activations are the learned low-dimensional representation. Variational autoencoders (VAEs) learn a structured probabilistic latent space.

## Dimensionality reduction in the LLM era

With large language models producing embeddings of 768–4096 dimensions, dimensionality reduction is operationally important:

- **Storage**: a million 1536-dimensional float32 embeddings require 6GB. Reducing to 256 dimensions cuts this to 1GB.
- **ANN index speed**: HNSW index construction time scales with dimensionality. Reducing dimension via PCA before indexing is a common optimisation.
- **Matryoshka embeddings**: recent embedding models (e.g., OpenAI's text-embedding-3) support native dimensionality reduction — the model is trained so that the first k dimensions of a 1536-dim embedding are already a high-quality k-dim embedding. No post-hoc PCA needed.

For enterprise RAG systems handling large corpora, the practical decision is: use a model that natively supports dimension truncation at the quality level you need, rather than adding a separate PCA step in your pipeline.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies