Category · 9 terms
Hardware & Infrastructure
defined clearly.
GPUs, TPUs, accelerators, inference engines, and the silicon under it all.
CUDA
NVIDIA's parallel computing platform and programming model that exposes GPU compute to general-purpose code — the de facto language of AI research.
Distributed Training
Splitting model training across multiple GPUs or nodes — required for any model too large or training run too long to fit on a single accelerator.
FP8
8-bit floating-point number formats (E4M3, E5M2) that enable faster training and inference at minimal accuracy loss on modern AI accelerators.
GPU
Graphics Processing Unit — massively parallel hardware that powers virtually all modern AI training and most inference workloads.
Inference (Serving)
Running a trained model to produce predictions on new data — the production workload that dominates AI cost and latency after training completes.
Mixed-Precision Training
Training a neural network using a combination of higher precision (FP32 master weights) and lower precision (FP16/BF16 compute) to gain speed without sacrificing convergence.
Quantization
Compressing a neural network by representing weights and activations in lower-precision integer formats (INT8, INT4) — typically applied at inference time to reduce memory and latency.
Tensor Core
Specialized matrix-multiplication units inside NVIDIA GPUs (Volta and later) that deliver order-of-magnitude speedups for AI workloads in lower precisions.
TPU
Tensor Processing Unit — Google's custom AI accelerator chip, used in Google Cloud and to train Google's own models including Gemini.