Skip to main content
Hong Kong
AIMenta
advanced · Deep Learning

Multi-Head Attention

Parallel attention computations that let a model attend to different relationship types simultaneously — the workhorse layer of every Transformer.

Multi-head attention is the mechanism that runs several self-attention operations in parallel on different learned projections of the input, then concatenates their outputs. Where single-head self-attention computes one weighted combination of values, multi-head attention computes `h` combinations (typically h=8, 12, 16, or 32 depending on the model) — each with its own query / key / value projection matrices. The intuition is that different heads can specialise on different relationship types: one on syntactic structure, one on coreference, one on positional relationships, one on topical relevance. In practice the specialisations learned are messier than that neat story — but the diversity of representations that multi-head produces is empirically what makes Transformers work as well as they do.

The mathematical structure matters for efficiency. In the standard formulation, the full-dimensional hidden state `d_model` is split across `h` heads of dimension `d_model / h` each, so multi-head attention costs roughly the same FLOPs as a single attention head at full dimension — you get the diversity bonus essentially for free. Modern large-model variants tweak this — **grouped-query attention** (GQA) shares keys and values across groups of heads, **multi-query attention** (MQA) shares them across all heads — trading some representational capacity for meaningful inference-memory savings. Llama, Mistral, and most 2024+ open-weight models use GQA.

For APAC mid-market practitioners, multi-head attention is a detail you inherit rather than design. It matters when choosing hardware (attention's memory pattern dominates GPU memory usage during long-context inference), when choosing quantisation schemes (per-head vs per-layer quantisation has different quality/cost tradeoffs), and when reading model architecture specs (head count and head dimension hint at inference cost characteristics).

The non-obvious operational note: **head pruning is a cheap inference-cost lever**. Research consistently finds that not every head contributes equally — often 30-50% of heads can be pruned with minimal quality loss on specific downstream tasks. This is rarely used for general-purpose deployment because the right heads to prune vary by task, but for single-task specialised deployments (classification, summarisation of a known document type) it can cut inference cost noticeably without touching the underlying model more invasively.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies