Skip to main content
Global
AIMenta
Acronym intermediate · Hardware & Infrastructure

CUDA

NVIDIA's parallel computing platform and programming model that exposes GPU compute to general-purpose code — the de facto language of AI research.

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel-computing platform and programming model that exposes GPU compute to general-purpose code. The programming abstraction maps onto GPU hardware cleanly: a kernel is a function that runs in parallel across many threads, threads are grouped into blocks (sharing fast on-chip memory), blocks are grouped into grids, and warps of 32 threads execute in lockstep on SIMT hardware. The memory hierarchy — thread-local registers, block-shared memory, L1/L2 cache, global HBM — determines performance; good CUDA code is memory-aware before it is compute-aware. CUDA has effectively become the de facto language of AI research because the overwhelming majority of deep-learning infrastructure (PyTorch, TensorFlow, JAX, cuDNN, cuBLAS, FlashAttention) runs on top of it.

The 2026 CUDA ecosystem spans several layers. **CUDA 12.x runtime and toolkit** is the baseline. **cuDNN** (deep neural network primitives), **cuBLAS** (BLAS), **cuFFT**, **CUTLASS** (templated matmul library) cover the standard primitives. **Triton** (OpenAI, increasingly adopted) is a Python-like DSL that compiles to efficient CUDA — the path most new kernel work takes rather than raw CUDA C++. **FlashAttention** kernels (Tri Dao) are the canonical example of hand-optimised CUDA that outperforms framework-default attention by 2-5×. Competing ecosystems — **AMD ROCm** for MI300X, **Intel oneAPI** for Gaudi, **Apple Metal** — exist but remain substantially behind CUDA in maturity, ecosystem, and library coverage.

For APAC mid-market teams, the practical advice is **do not write raw CUDA unless you are writing custom kernels**. PyTorch with torch.compile, cuDNN, cuBLAS, and FlashAttention cover 95% of workloads at performance close to what hand-tuning produces. Triton is the right tool when you need custom kernels — for fused operations, novel attention variants, quantisation-aware primitives — and its learning curve is orders of magnitude gentler than raw CUDA C++. Hiring CUDA experts is expensive and justified only when your bottleneck is genuinely custom-kernel-shaped.

The non-obvious failure mode is **CUDA lock-in**. Code written against cuDNN-specific call signatures, CUDA-runtime features, or assumes-NVIDIA-hardware idioms doesn't port to ROCm or other ecosystems without material rewrite. For APAC teams concerned about vendor concentration (especially given export controls and GPU supply volatility), writing in higher-level frameworks (PyTorch, JAX) that abstract CUDA-specific details preserves optionality. Teams that embed raw CUDA deeply into their stack have reduced flexibility to move to AMD or custom silicon when economics or policy shift.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies