Skip to main content
Singapore
AIMenta

Distance Metric

The function used to measure how far apart two vectors are — choice of metric must match how the embedding model was trained.

A distance metric is a function that quantifies how far apart two objects are in some measurable space. Formally, a distance function d(x, y) must satisfy four conditions: non-negativity (d(x,y) ≥ 0), identity (d(x,x) = 0), symmetry (d(x,y) = d(y,x)), and the triangle inequality (d(x,z) ≤ d(x,y) + d(y,z)). If the triangle inequality is dropped, the function is a **pseudometric** or **similarity measure** rather than a proper metric.

## Common distance metrics

**Euclidean distance**: the straight-line distance in n-dimensional space. The default for continuous, normalised data. Sensitive to scale — features with large ranges dominate. Always normalise before using Euclidean distance.

**Manhattan (L1) distance**: sum of absolute differences across dimensions. Less sensitive to outliers than Euclidean. Used in L1 regularization, sparse retrieval (BM25 is L1-like), and taxi-cab geometry.

**Cosine similarity / cosine distance**: measures the angle between two vectors, ignoring magnitude. The standard metric for text embeddings, document similarity, and recommendation systems. When you care about direction rather than scale, cosine similarity is the right choice.

**Dot product**: closely related to cosine similarity but not a proper metric (does not satisfy non-negativity or triangle inequality). Many neural retrieval systems optimise dot product directly for efficiency.

**Hamming distance**: counts positions where two binary strings differ. Used in error-correcting codes, DNA comparison, and hashing-based approximate nearest-neighbour search.

**Jaccard similarity**: size of the intersection divided by size of the union of two sets. Used for set-based similarity — comparing product categories, tag sets, or document term bags.

## Distance metrics in machine learning

Distance metrics underpin several ML algorithms:

- **k-nearest neighbours (k-NN)**: classifies a point by the majority class of its k closest neighbours. Metric choice determines what "close" means — cosine for text, Euclidean for tabular data with normalised features.
- **k-means clustering**: assigns each point to the cluster with the nearest centroid, measured by Euclidean distance by default. Switching to cosine distance yields spherical k-means, better for text.
- **Vector databases and semantic search**: similarity search over embeddings uses cosine or dot product distance. The choice of distance function must match how the embedding model was trained.

## Approximate nearest-neighbour search

Exact nearest-neighbour search scales as O(n·d) — prohibitive for billion-vector indices. Approximate nearest-neighbour (ANN) algorithms like **HNSW**, **IVF**, **Annoy**, and **ScaNN** trade a small accuracy loss for orders-of-magnitude speed gains. The distance metric must be supported by the chosen ANN algorithm — cosine and Euclidean are universally supported; custom metrics require custom implementations.

## Practical guidance for enterprise teams

- For embedding-based search (RAG, semantic search, recommendation): use cosine similarity. Normalise embeddings to unit length to convert cosine similarity to Euclidean distance, enabling more efficient index structures.
- For tabular data: normalise features, then use Euclidean. Consider domain-specific metrics (Mahalanobis distance accounts for feature correlations).
- Monitor metric choice as part of model evaluation — a wrong metric can make an accurate embedding model look poor in retrieval benchmarks.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies