The Transformer, introduced by Vaswani et al. in 2017's "Attention Is All You Need", is the neural-network architecture that replaced recurrent networks for sequence modelling and became the backbone of essentially every modern AI system — GPT, Claude, Gemini, BERT, T5, Llama, Stable Diffusion, CLIP, Whisper, most modern TTS, most modern video models. The core innovation was **self-attention**: a mechanism that lets every position in a sequence directly attend to every other position in one operation, without the step-by-step recurrence that made RNNs slow to train and hard to scale.
The architecture consists of stacked blocks, each containing a multi-head self-attention layer, a position-wise feed-forward network, residual connections, and layer normalisation. The original 2017 paper described an encoder-decoder for machine translation; the major 2018+ descendants specialised this — **encoder-only** (BERT, RoBERTa) for classification and embedding, **decoder-only** (GPT family, Llama, Claude) for autoregressive generation, **encoder-decoder** (T5, BART) for seq-to-seq tasks. Modern LLMs are overwhelmingly decoder-only. Vision transformers (ViT), multimodal models (CLIP, Flamingo), and diffusion backbones (DiT) all take the core block and re-apply it to non-text inputs.
For APAC mid-market teams, the architecture itself is rarely a decision point — you inherit it from whichever foundation model you build on. What matters is knowing enough to make the decisions that sit on top: context length (which is a Transformer property, constrained by attention's quadratic memory cost), inference cost (dominated by attention during decoding), fine-tuning strategies (which layers to adapt), and quantisation behaviour (attention layers tolerate quantisation differently than feed-forward).
The non-obvious operational note: **attention is the bottleneck for long-context workloads**. Naive attention is O(n²) in sequence length, which is why million-token context windows required architectural innovations (FlashAttention, sliding-window attention, state-space hybrids, mixture-of-experts variants). If your workload depends on long context, benchmark real latency on your actual context distribution — headline context limits rarely match in-the-wild performance at full length.
Where AIMenta applies this
Service lines where this concept becomes a deliverable for clients.
Beyond this term
Where this concept ships in practice.
Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.
Other service pillars
By industry