Skip to main content
Mainland China
AIMenta

Tensor Core

Specialized matrix-multiplication units inside NVIDIA GPUs (Volta and later) that deliver order-of-magnitude speedups for AI workloads in lower precisions.

A tensor core is a specialised hardware unit inside modern NVIDIA GPUs that executes a fused matrix-multiply-and-accumulate operation in a single instruction — typically computing D = A × B + C where A and B are small tile matrices and C/D are accumulators. By fusing what would otherwise be many scalar multiply-accumulate operations and dedicating silicon to this specific pattern, tensor cores achieve throughput 4-16× higher than general CUDA cores for the same workload, at the cost of only supporting a fixed set of dtype pairs. Since the operations that dominate deep-learning compute — attention, feed-forward layers, convolutions — decompose into exactly these matrix multiplies, tensor cores have become the primary performance engine of AI GPUs.

The generational progression tells the story. **Volta (V100, 2017)** introduced first-generation tensor cores with FP16 input / FP32 accumulate. **Turing (T4, RTX 20xx)** added INT8 and INT4. **Ampere (A100, 2020)** added TF32 and BF16 native support, plus sparsity acceleration. **Hopper (H100, H200, 2022-24)** added FP8 tensor cores and dynamic programming acceleration. **Blackwell (B100, B200, GB200, 2024-25)** added FP4 for inference and second-generation Transformer Engine support. Within-generation features matter — WMMA (Warp Matrix Multiply Accumulate) instructions exposed through CUDA C++, and the higher-level Transformer Engine library that manages FP8 scaling automatically.

For APAC mid-market teams, the practical rule is **if your workload isn't using tensor cores, your GPU is roughly 10× underutilised**. PyTorch, TensorFlow, and JAX default paths engage tensor cores automatically when the model runs in FP16, BF16, or FP8 — but only if the tensor shapes align with tensor-core tile dimensions (typically multiples of 8 or 16). Verify your training actually uses tensor cores by profiling (NVIDIA Nsight, PyTorch Profiler) and checking the 'SM utilisation' and 'Tensor Core utilisation' metrics are both high. A workload showing high SM usage but low tensor-core usage is a sign the framework is falling back to CUDA-core paths.

The non-obvious failure mode is **dtype mismatch bypassing tensor cores silently**. A model declared in FP32 (or a specific layer inadvertently left in FP32) runs on CUDA cores rather than tensor cores, and the user sees slow throughput without any error. This is especially common in mixed-precision settings where one layer was accidentally kept FP32 and becomes a bottleneck. Profile tensor-core utilisation, cast deliberately, and prefer frameworks (PyTorch's torch.compile, JAX's JIT) that surface dtype-related performance issues explicitly rather than silently.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies