Skip to main content
Vietnam
AIMenta
Acronym foundational · Hardware & Infrastructure

GPU

Graphics Processing Unit — massively parallel hardware that powers virtually all modern AI training and most inference workloads.

A GPU (Graphics Processing Unit) is a massively parallel processor originally designed for graphics rendering that, beginning around 2009-12, became the default compute substrate for deep learning. The architecture is SIMT (Single Instruction, Multiple Threads): thousands of simple cores executing the same operation across different data, organised into warps, blocks, and grids, with a memory hierarchy spanning registers, shared memory, L1/L2 cache, and HBM/VRAM. Modern AI GPUs add **tensor cores** — specialised matrix-multiply-and-accumulate units that deliver order-of-magnitude speedups over general SIMT for the matmuls at the heart of neural networks. NVLink and NVSwitch interconnects stitch GPUs into multi-GPU nodes with terabytes-per-second bandwidth, and InfiniBand or RoCE scales further into cluster-scale training.

The 2026 GPU landscape is tiered. **NVIDIA H100 / H200** are the mainstream training workhorses, widely available in cloud and on-prem, FP8 tensor core support. **NVIDIA B100 / B200 / GB200 (Blackwell)** are the frontier generation, deployed in 2024-25, with FP8/FP4 and faster NVLink. **NVIDIA A100** remains in use for workloads that don't need Hopper features. **AMD MI300X** is the credible alternative with strong HBM and memory bandwidth but a smaller software ecosystem (ROCm). **Consumer GPUs (RTX 4090, 5090)** remain viable for small-model development but hit hard VRAM limits. **Cloud availability** varies: AWS (H100/H200/B-series on p5/p6 instances), Azure (ND-series), GCP (A3/A4), CoreWeave / Lambda / Nebius as specialist GPU clouds.

For APAC mid-market teams, the pragmatic rule is **rent, don't buy, until utilisation justifies ownership**. An H100 purchased for a two-year project that only uses it 30% of the time is substantially more expensive than the equivalent rental. Buy or long-lease only when sustained utilisation exceeds ~80% over an 18-month horizon, or when data-residency requirements rule out managed cloud. Within rental choice: AWS / Azure / GCP for integrated cloud workflow, CoreWeave / Lambda / Nebius for pure GPU price-performance, reserved capacity for cost optimisation once demand stabilises.

The non-obvious failure mode is **assuming consumer GPUs scale**. A team prototypes on a 24GB RTX 4090, works well, then tries to run the same model at production scale and hits a VRAM wall — H100 has 80GB, H200 has 141GB, B200 has 192GB, and the production-size models simply don't fit on consumer cards. Plan hardware headroom for 2-3× the prototype size, especially for LLM workloads where KV-cache memory during inference often dominates model weights. VRAM is the bottleneck in 2026, not compute.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies