Skip to main content
Taiwan
AIMenta

Mixed-Precision Training

Training a neural network using a combination of higher precision (FP32 master weights) and lower precision (FP16/BF16 compute) to gain speed without sacrificing convergence.

Mixed-precision training computes forward passes and gradients in lower-precision formats (FP16 or BF16) while maintaining a master copy of weights in FP32 for accumulation and update. The speedup comes from modern AI accelerators that dedicate specialised hardware (tensor cores on NVIDIA, matrix units on TPU) to low-precision matrix multiplies — FP16 and BF16 run 2-4× faster than FP32 on the same silicon, and FP8 on Hopper/Blackwell another 2×. Memory footprint also drops roughly in half, which lets teams fit larger batch sizes or larger models on the same GPU. The trade-off is numerical: FP16 has a narrow dynamic range that causes gradient underflow on small values without careful handling.

The 2026 landscape has converged on a few patterns. **BF16** (bfloat16, 8-bit exponent, 7-bit mantissa) has effectively replaced FP16 for most training because its FP32-identical exponent range removes the need for loss scaling. **FP16** persists for inference and for hardware where BF16 is unavailable. **FP8** (E4M3 forward, E5M2 backward) is the emerging frontier, supported natively on H100/H200/B100/B200 with NVIDIA's Transformer Engine managing precision transitions automatically. Framework support is seamless — `torch.amp`, `jax.lax.Precision`, `tf.keras.mixed_precision` all make enabling mixed precision a one-line change. The deprecated NVIDIA Apex library is no longer needed.

For APAC mid-market teams, the practical guidance is **enable mixed precision on every training run using A100, H100, H200, B100/B200, or TPU v4+**. There is no reason not to — the speedup is substantial (typically 2-3×), the memory saving is material, and the quality impact with BF16 is indistinguishable from FP32 for the vast majority of workloads. The one-line enablement means there is no engineering cost. For inference, mixed precision or outright quantization to FP16/INT8 is now the default.

The non-obvious failure mode is **overflow from unscaled gradients in FP16**. When FP16 values exceed ~65,504, they overflow to infinity, corrupting the gradient; loss scaling (multiply loss by a scalar before backward, unscale after) was the traditional fix. BF16 sidesteps this because its exponent range matches FP32. Teams still using FP16 on older hardware need to configure loss scaling (dynamic or static) carefully; BF16 teams can skip the whole problem. If you see NaN losses partway through training, the first debug step is checking precision configuration.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies