Skip to main content
Malaysia
AIMenta
intermediate · Deep Learning

Backpropagation

The algorithm that computes gradients through a neural network by the chain rule — the engine that makes deep learning possible.

Backpropagation is the algorithm that computes the gradients of a loss function with respect to every parameter in a neural network by applying the chain rule of calculus layer-by-layer, working backward from the output. The algorithm's 1986 popularisation by Rumelhart, Hinton, and Williams made training multi-layer networks practical for the first time and effectively created the modern neural-network field. Every deep-learning optimiser — SGD, Adam, AdamW, Lion — runs gradient descent on gradients computed via backpropagation. Without it, nothing at scale in contemporary AI would exist.

Mechanically, backpropagation requires a **forward pass** (compute predictions and loss), a **backward pass** (propagate gradient signal from the loss back through every operation), and careful **bookkeeping** of intermediate activations (needed to compute gradients). Modern autograd frameworks — PyTorch, JAX, TensorFlow — automate the bookkeeping via computational-graph tracking. You write the forward pass as ordinary tensor operations; the framework records the graph and produces the backward pass for free. This automation is one of the most consequential software engineering advances in the history of ML — it collapsed what was specialist work into a library call.

For APAC mid-market teams, backpropagation is a detail handled by whichever framework you use. Where it becomes operationally relevant is in **memory cost** — the activations stored for the backward pass dominate GPU memory during training, often exceeding the model parameters themselves. **Gradient checkpointing** (recompute some activations during backward instead of storing them) trades compute for memory and is the standard lever for fitting larger models. **Mixed-precision training** halves memory at little quality cost. **Gradient accumulation** simulates larger batch sizes by accumulating gradients across micro-batches before an optimiser step. Understanding these three levers is often the difference between a training run that fits on your hardware and one that does not.

The non-obvious historical note: **backpropagation was invented multiple times before 1986**. Werbos described it in his 1974 thesis; Linnainmaa's 1970 reverse-mode automatic differentiation is the same algorithm in disguise; several other researchers arrived independently. The 1986 paper's contribution was as much presentation and context as invention. The pattern — important algorithms are often independently discovered and only later consolidated — repeats across AI history, and is worth remembering when assessing "novel" technique claims.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies