Gradient Descent — AIMenta AI Encyclopedia

Gradient descent is the iterative optimisation algorithm that trains nearly every modern machine-learning model. Given a loss function measuring how wrong the model is on training data, gradient descent computes the partial derivative of the loss with respect to each parameter and updates each parameter in the direction that reduces loss, scaled by a **learning rate**. Iterate until the loss stops improving, or stop early if validation loss starts going up.

The algorithm's dominance is not because it is optimal — it is because it scales. Closed-form solutions and second-order methods (Newton's method, L-BFGS) converge in fewer steps but require operations that become intractable when parameter counts reach millions or billions. Gradient descent's update is O(parameters) per step, embarrassingly parallelisable across GPUs, and requires only first-order derivatives that modern autograd frameworks compute essentially for free.

For deep learning, vanilla gradient descent is rarely used as-is. Practitioners use **stochastic gradient descent** (mini-batch sampling), **momentum** (exponential moving average of past gradients), and **adaptive learning rates** (Adam, AdaGrad, RMSprop). Learning-rate schedules — warmup, cosine decay, step decay, OneCycle — are standard. The combined recipe (typically AdamW with a warmup + cosine schedule) has been remarkably robust across a wide range of architectures and tasks.

For APAC mid-market teams, the relevant mental model is: **the choice of optimiser matters less than getting the learning rate right for it**. A well-tuned SGD beats a poorly-tuned Adam; a well-tuned Adam beats a well-tuned SGD on most transformer fine-tuning. Spend your tuning budget on learning-rate search (log-scale grid of 1e-3, 3e-4, 1e-4, 3e-5, 1e-5) and learning-rate schedule before you deliberate over optimiser choice. The worst-case learning rate is an order of magnitude off; the worst-case optimiser swap is rarely worse than 2× slower convergence.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service AI Strategy & Advisory service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies