Skip to main content
Vietnam
AIMenta
intermediate · Machine Learning

Gradient Descent

The iterative optimization algorithm that trains nearly every modern ML model — adjust parameters in the direction that most reduces the loss.

Gradient descent is the iterative optimisation algorithm that trains nearly every modern machine-learning model. Given a loss function measuring how wrong the model is on training data, gradient descent computes the partial derivative of the loss with respect to each parameter and updates each parameter in the direction that reduces loss, scaled by a **learning rate**. Iterate until the loss stops improving, or stop early if validation loss starts going up.

The algorithm's dominance is not because it is optimal — it is because it scales. Closed-form solutions and second-order methods (Newton's method, L-BFGS) converge in fewer steps but require operations that become intractable when parameter counts reach millions or billions. Gradient descent's update is O(parameters) per step, embarrassingly parallelisable across GPUs, and requires only first-order derivatives that modern autograd frameworks compute essentially for free.

For deep learning, vanilla gradient descent is rarely used as-is. Practitioners use **stochastic gradient descent** (mini-batch sampling), **momentum** (exponential moving average of past gradients), and **adaptive learning rates** (Adam, AdaGrad, RMSprop). Learning-rate schedules — warmup, cosine decay, step decay, OneCycle — are standard. The combined recipe (typically AdamW with a warmup + cosine schedule) has been remarkably robust across a wide range of architectures and tasks.

For APAC mid-market teams, the relevant mental model is: **the choice of optimiser matters less than getting the learning rate right for it**. A well-tuned SGD beats a poorly-tuned Adam; a well-tuned Adam beats a well-tuned SGD on most transformer fine-tuning. Spend your tuning budget on learning-rate search (log-scale grid of 1e-3, 3e-4, 1e-4, 3e-5, 1e-5) and learning-rate schedule before you deliberate over optimiser choice. The worst-case learning rate is an order of magnitude off; the worst-case optimiser swap is rarely worse than 2× slower convergence.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies