Stochastic Gradient Descent (SGD) — AIMenta AI Encyclopedia

Stochastic Gradient Descent (SGD) computes the parameter update using a randomly-sampled mini-batch of training examples rather than the full dataset. The gradient is noisier than a full-batch gradient, but each update is orders of magnitude faster, and the noise itself acts as a regulariser that helps the optimiser escape bad local minima. The combination — fast updates plus a free regularisation effect — is why SGD and its descendants dominate deep-learning training.

The modern optimiser landscape is mostly SGD variants with adaptive per-parameter learning rates. **SGD with momentum** accumulates a running average of past gradients, accelerating motion along consistent directions and damping oscillation — still the standard optimiser for vision CNNs. **Adam** and **AdamW** (with decoupled weight decay) combine momentum with per-parameter learning-rate adaptation — the default for Transformers and most NLP. **Adafactor** and **Lion** are memory-efficient variants used at the largest pretraining scales. **Shampoo** and second-order methods compete at the very largest scale where the extra compute per step pays off in total-wall-clock-time.

For APAC mid-market teams, the practical advice is to take the optimiser your chosen architecture or training recipe specifies and tune the learning rate carefully. AdamW with learning rate 1e-4 to 5e-5 is a strong default for fine-tuning pretrained transformers. SGD with momentum 0.9 and learning rate 0.01–0.1 is a strong default for vision CNNs trained from scratch. Deviating from these defaults rarely helps before you have exhausted more impactful levers.

The non-obvious performance trap is **batch size and learning rate coupling**. Scaling batch size up without scaling learning rate down (or vice versa) changes training dynamics in subtle ways — effective regularisation decreases, and the model can generalise worse despite lower training loss. The usual heuristics: linear scaling for mid-sized batches, square-root scaling for large batches, warmup schedules to survive the first few steps at a high rate.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

service AI Strategy & Advisory service Infrastructure & Cloud

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Other service pillars

Training & Enablement Talent & Hiring Workflow Automation Software & Platforms

By industry

Financial services Retail & e-commerce Manufacturing Logistics Healthcare Professional services Public sector Real estate Technology Education

By Asian market

🇭🇰 Hong Kong 🇨🇳 Mainland China 🇹🇼 Taiwan 🇯🇵 Japan 🇰🇷 Korea 🇸🇬 Singapore 🇲🇾 Malaysia 🇻🇳 Vietnam 🇮🇩 Indonesia

Continue with All terms · AI tools · Insights · Case studies