Skip to main content
Global
AIMenta

Distributed Training

Splitting model training across multiple GPUs or nodes — required for any model too large or training run too long to fit on a single accelerator.

Distributed training splits a neural-network training run across multiple GPUs or nodes, either because the model is too large to fit on one accelerator, the training time on one accelerator is prohibitive, or both. Four orthogonal parallelism strategies get combined in practice. **Data parallelism** replicates the model on every device and splits batches — simplest, works up to where the optimiser state fits on one device. **Tensor parallelism** splits individual matrix operations across devices — effective within a node over NVLink, less so across nodes. **Pipeline parallelism** splits layers across devices — reduces memory but introduces pipeline bubbles. **Sequence parallelism** splits along the sequence dimension — increasingly important for long-context LLM training. Frontier-scale training uses all four simultaneously (3D / 4D parallelism), orchestrated carefully to keep communication from dominating compute.

The 2026 tooling landscape has stabilised. **PyTorch DDP** is the baseline for data parallelism. **PyTorch FSDP** (Fully Sharded Data Parallel) shards optimiser state, gradients, and weights — the right default for models that exceed single-GPU memory up to roughly 30B parameters. **DeepSpeed ZeRO** (stages 1/2/3) covers similar ground with different ergonomics and is strong for training on mixed hardware. **Megatron-LM** and **NeMo** (NVIDIA) handle tensor and pipeline parallelism for frontier-scale training. **JAX pmap / pjit / shard_map** are the corresponding primitives in the JAX ecosystem, preferred on TPU. High-level wrappers — **PyTorch Lightning**, **Hugging Face Accelerate**, **Composer** — abstract the common cases.

For APAC mid-market teams, the practical progression is **single-GPU → DDP → FSDP → 3D parallelism**. Most mid-market workloads never leave DDP because the models that fit business workloads (fine-tuning 7B-70B LLMs, training domain-specific embedders, tuning CV models) fit on single H100/H200 or a small multi-GPU node with FSDP. 3D parallelism is for teams training frontier-scale models from scratch — a small fraction of mid-market AI work. Most teams benefit more from better data and evaluation than from chasing larger training runs.

The non-obvious failure mode is **debugging distributed is 10× harder than single-GPU**. A bug that manifests only under FSDP with a specific shard configuration, only on the third node, only after hour 12 of a 48-hour training run — teams lose days or weeks to this class of issue. Budget time accordingly, start every distributed project with aggressive single-node testing first, and only scale up once the training loop is proven. Standard PyTorch distributed debugging tools (NCCL logs, torchrun verbose output, Nsight Compute) are necessary but not always sufficient; defensive logging and checkpoint-resume infrastructure save expensive re-runs.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies