FP8 — AIMenta AI Encyclopedia

FP8 is an 8-bit floating-point family defined by two formats from the 2022 OFP8 specification: **E4M3** (4-bit exponent, 3-bit mantissa) — more precision, smaller dynamic range, used for forward activations and weights; and **E5M2** (5-bit exponent, 2-bit mantissa) — less precision, wider range, used for backward gradients that need the range to express very small values. The point of FP8 is to halve memory and double throughput relative to FP16/BF16 on hardware that natively supports it. Modern frontier training at 2026 scale relies on FP8 for roughly 2× speedup over BF16 and the difference between needing thousands versus tens-of-thousands of GPUs for a run.

The 2026 landscape is defined by hardware and tooling. **NVIDIA Hopper (H100, H200)** was the first mainstream GPU with native FP8 tensor cores. **NVIDIA Blackwell (B100, B200, GB200)** refined FP8 performance and added FP4 for inference. **NVIDIA Transformer Engine** (open-source library for PyTorch and JAX) handles automatic FP8 casting, scaling factor management, and backup-path fallback for layers that are unstable in FP8. Competing accelerators (**AMD MI300X**, **Intel Gaudi 3**) added FP8 support with varying maturity. TPU v5+ supports INT8 and BF16 prominently with less emphasis on FP8; Google's stack reaches similar throughput through different casting choices.

For APAC mid-market teams, FP8 is the **correct choice for training frontier models on H100/H200/B-series GPUs**, and increasingly for inference on the same hardware. The speedup compounds over long training runs — a 2× throughput advantage on a month-long run is fifteen days of GPU time and real money. For inference-only workloads, the easier-to-deploy alternative is INT8 post-training quantization (via GPTQ, AWQ, bitsandbytes), which often matches FP8 inference quality with simpler tooling. Choose FP8 training when you have H100+ hardware and Transformer Engine integrated; choose INT8 inference when you want broader hardware compatibility.

The non-obvious failure mode is **selective instability in specific layers**. FP8 does not work cleanly for every operation — attention softmax, layer-norm statistics, and certain loss computations can overflow or lose numerical fidelity badly. The production pattern is keeping these operations in BF16 or FP32 while casting the bulk of matmuls to FP8. Transformer Engine handles most of this automatically but not always perfectly; quality regressions on specific workloads usually trace to a layer that should have been kept higher-precision. When FP8 training diverges, check which layers are being cast before assuming a data issue.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

AI Strategy & Advisory Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies