Skip to main content
Taiwan
AIMenta
Acronym intermediate · Machine Learning

Stochastic Gradient Descent (SGD)

A gradient descent variant that updates parameters using a randomly-sampled subset of data per step, trading exact gradients for speed.

Stochastic Gradient Descent (SGD) computes the parameter update using a randomly-sampled mini-batch of training examples rather than the full dataset. The gradient is noisier than a full-batch gradient, but each update is orders of magnitude faster, and the noise itself acts as a regulariser that helps the optimiser escape bad local minima. The combination — fast updates plus a free regularisation effect — is why SGD and its descendants dominate deep-learning training.

The modern optimiser landscape is mostly SGD variants with adaptive per-parameter learning rates. **SGD with momentum** accumulates a running average of past gradients, accelerating motion along consistent directions and damping oscillation — still the standard optimiser for vision CNNs. **Adam** and **AdamW** (with decoupled weight decay) combine momentum with per-parameter learning-rate adaptation — the default for Transformers and most NLP. **Adafactor** and **Lion** are memory-efficient variants used at the largest pretraining scales. **Shampoo** and second-order methods compete at the very largest scale where the extra compute per step pays off in total-wall-clock-time.

For APAC mid-market teams, the practical advice is to take the optimiser your chosen architecture or training recipe specifies and tune the learning rate carefully. AdamW with learning rate 1e-4 to 5e-5 is a strong default for fine-tuning pretrained transformers. SGD with momentum 0.9 and learning rate 0.01–0.1 is a strong default for vision CNNs trained from scratch. Deviating from these defaults rarely helps before you have exhausted more impactful levers.

The non-obvious performance trap is **batch size and learning rate coupling**. Scaling batch size up without scaling learning rate down (or vice versa) changes training dynamics in subtle ways — effective regularisation decreases, and the model can generalise worse despite lower training loss. The usual heuristics: linear scaling for mid-sized batches, square-root scaling for large batches, warmup schedules to survive the first few steps at a high rate.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies