Skip to main content
Vietnam
AIMenta
intermediate · Machine Learning

Adam Optimizer

An adaptive-learning-rate optimiser that combines momentum with per-parameter scaling — the default optimiser for training Transformers and most deep networks.

The Adam optimiser (Kingma & Ba, 2015) combines two ideas from the optimiser research landscape: **momentum** (an exponential moving average of past gradients that smooths direction and accelerates along consistent slopes) and **per-parameter adaptive learning rates** (scaling each parameter's update by the running standard deviation of its gradients, so parameters with noisy gradients get smaller steps than parameters with steady ones). The combination produced an optimiser that worked well out-of-the-box on a wide range of deep-learning tasks with far less hyperparameter tuning than plain SGD + momentum required, and Adam rapidly became the default.

**AdamW** (Loshchilov & Hutter, 2019) is the variant practitioners actually use today. The difference is subtle but material: classical Adam folds L2 regularisation into the gradient, which interacts badly with the adaptive learning-rate scaling; AdamW decouples weight decay as a separate update, restoring the regularisation behaviour Adam users intuitively expected. Nearly every modern Transformer is trained with AdamW. Lion (Chen et al., 2023), Shampoo (second-order), and Adafactor (memory-efficient) compete at the scale where Adam's memory footprint — two extra tensors per parameter — becomes prohibitive.

For APAC mid-market teams, the practical rule is straightforward: **use AdamW for anything involving Transformers or modern deep networks unless you have a specific reason not to**. Common learning-rate sweet spots — 1e-4 to 5e-5 for fine-tuning pretrained models, 1e-3 for training from scratch on smaller networks. Warmup schedules (linear or cosine warmup over the first 5-10% of steps) are standard and worth including by default. For vision CNNs trained from scratch, SGD with momentum often still outperforms AdamW; for almost everything else, AdamW is the safe starting point.

The non-obvious operational note: **Adam's memory cost is 2× the parameter count** (one tensor for the momentum estimate, one for the variance estimate). For multi-billion-parameter models this is a real constraint; Adafactor, 8-bit Adam, and CPU-offloaded optimiser states all exist to compress it. If you are fine-tuning a 70B model on a single 80GB GPU, the optimiser state often determines whether it fits — and 8-bit AdamW or a low-memory alternative is the lever that usually makes the difference.

Where AIMenta applies this

Service lines where this concept becomes a deliverable for clients.

Beyond this term

Where this concept ships in practice.

Encyclopedia entries name the moving parts. The links below show where AIMenta turns these concepts into engagements — across service pillars, industry verticals, and Asian markets.

Continue with All terms · AI tools · Insights · Case studies