Gradient descent is the engine behind nearly every machine learning model you'll encounterโfrom simple linear regression to deep neural networks with millions of parameters. Understanding how these algorithms navigate the loss landscape isn't just academic; it's the difference between a model that converges efficiently and one that stalls, oscillates, or settles into a poor local minimum. You're being tested on your ability to analyze convergence behavior, compare computational trade-offs, and select appropriate optimizers for different problem structures.
The algorithms in this guide demonstrate core numerical principles: iterative optimization, adaptive step sizes, momentum dynamics, and the trade-off between computational cost and convergence stability. Don't just memorize which algorithm uses which update ruleโknow why each modification improves performance and when each approach is most appropriate. FRQs often ask you to justify optimizer selection or analyze convergence properties, so focus on the underlying mechanisms.
First-Order Methods: Basic Gradient Updates
These foundational algorithms compute gradients directly from data and update parameters proportionally. The key trade-off is between gradient accuracy (using more data) and computational efficiency (using less).
Batch Gradient Descent
Computes gradients over the entire datasetโprovides the true gradient direction at each iteration, ensuring deterministic updates
Stable convergence path with no noise in updates, making it ideal for convex optimization problems with smooth loss surfaces
Computational cost scales linearly with dataset size, making it impractical for large-scale problems where n exceeds memory limits
Stochastic Gradient Descent (SGD)
Updates parameters using a single data pointโeach iteration costs O(1) in terms of data access, enabling rapid iteration
Inherent noise acts as regularization and can help escape shallow local minima in non-convex landscapes
High variance in updates causes the loss function to fluctuate, requiring careful learning rate scheduling for convergence
Mini-Batch Gradient Descent
Uses a subset of b samples per updateโbalances gradient accuracy against computational efficiency with typical batch sizes of 32โ256
Enables vectorized computation on GPUs, achieving better hardware utilization than pure SGD while maintaining stochastic benefits
Standard choice in practice for deep learning, offering the best trade-off between convergence speed and stability
Compare: Batch vs. SGDโboth compute gradients from data, but Batch uses all n points (low variance, high cost) while SGD uses one (high variance, low cost). Mini-batch interpolates between them. If an FRQ asks about scalability vs. stability trade-offs, this comparison is your go-to example.
Momentum Methods: Accelerating Convergence
Momentum-based approaches accumulate velocity from past gradients to smooth updates and accelerate progress. The core insight is that gradient history provides information about the loss surface's curvature.
Momentum-based Gradient Descent
Accumulates a velocity termvtโ=ฮณvtโ1โ+ฮทโL(ฮธ) where ฮณ (typically 0.9) controls how much history to retain
Dampens oscillations in narrow valleys by averaging out perpendicular gradient components while reinforcing consistent directions
Accelerates convergence from O(1/t) to O(1/t2) for convex functions under appropriate conditions
Nesterov Accelerated Gradient (NAG)
Evaluates gradient at the "lookahead" positionฮธโฮณvtโ1โ rather than current parametersโa subtle but powerful correction
Provides anticipatory updates that reduce overshooting by incorporating where momentum is already taking you
Achieves optimal convergence rate of O(1/k2) for smooth convex functions, provably faster than standard momentum
Compare: Momentum vs. NAGโboth use velocity accumulation, but NAG computes gradients at the projected position rather than the current one. This "lookahead" reduces oscillations near minima. NAG is particularly effective when you need precise convergence in the final stages of optimization.
These algorithms automatically adjust learning rates for each parameter based on gradient history. The key principle is that parameters with large historical gradients should take smaller steps, and vice versa.
Adagrad
Accumulates squared gradients in a diagonal matrix Gtโ=โฯ=1tโgฯโโgฯโ, scaling updates by 1/Gtโ+ฯตโ
Excels with sparse features by giving infrequent parameters larger effective learning ratesโcritical for NLP and recommendation systems
Learning rate monotonically decreases and can become vanishingly small, causing premature convergence before reaching the optimum
RMSprop
Uses exponential moving average of squared gradients: E[g2]tโ=ฯE[g2]tโ1โ+(1โฯ)gt2โ with typical ฯ=0.9
Prevents learning rate decay by "forgetting" old gradients, maintaining effective updates throughout training
Designed for non-stationary objectives where the loss surface changes during trainingโstandard for recurrent neural networks
Adam (Adaptive Moment Estimation)
Combines momentum and RMSprop by tracking both first moment mtโ (mean) and second moment vtโ (uncentered variance) of gradients
Includes bias correction terms m^tโ=mtโ/(1โฮฒ1tโ) and v^tโ=vtโ/(1โฮฒ2tโ) to account for initialization at zero
Default optimizer for deep learning due to robust performance across architectures with minimal hyperparameter tuning (ฮฒ1โ=0.9, ฮฒ2โ=0.999)
Compare: Adagrad vs. RMSprop vs. Adamโall adapt learning rates per-parameter, but Adagrad accumulates indefinitely (good for sparse, convex), RMSprop uses exponential decay (good for non-stationary), and Adam adds momentum on top (best general-purpose). Know that Adam can sometimes generalize worse than SGD with momentum on certain problems.
Second-Order and Quasi-Newton Methods: Curvature Information
These methods incorporate or approximate second-derivative information to achieve faster convergence. The trade-off is between the superior convergence rate and the cost of computing or storing curvature information.
L-BFGS (Limited-memory BFGS)
Approximates the inverse Hessian using only the most recent m gradient pairs (typically m=10), requiring O(mn) storage instead of O(n2)
Achieves superlinear convergence near the optimum by incorporating curvature information without explicit Hessian computation
Preferred for convex optimization in statistics and classical ML (logistic regression, CRFs) where exact convergence matters more than mini-batch compatibility
Conjugate Gradient
Generates search directions that are conjugate with respect to the Hessian, meaning diTโHdjโ=0 for i๎ =j
Solves n-dimensional quadratic problems in exactly n iterationsโoptimal for least squares and linear systems Ax=b
Memory efficient at O(n) storage, making it ideal for large-scale problems where even L-BFGS's limited memory is too expensive
Compare: L-BFGS vs. Conjugate Gradientโboth exploit curvature without storing the full Hessian, but L-BFGS approximates it from gradient history while CG generates conjugate directions iteratively. L-BFGS is more general; CG is optimal for quadratic/linear problems. Neither handles stochastic gradients well, limiting their use in deep learning.
Quick Reference Table
Concept
Best Examples
Deterministic vs. Stochastic Updates
Batch GD, SGD, Mini-batch
Momentum Acceleration
Momentum, NAG
Adaptive Learning Rates
Adagrad, RMSprop, Adam
Sparse Data Handling
Adagrad, Adam
Second-Order Approximation
L-BFGS, Conjugate Gradient
Deep Learning Default
Adam, SGD with Momentum
Convex Optimization
L-BFGS, Conjugate Gradient, Batch GD
Non-Stationary Objectives
RMSprop, Adam
Self-Check Questions
Which two algorithms both use momentum, and how does NAG's "lookahead" modification improve upon standard momentum?
Compare Adagrad and RMSprop: what problem does RMSprop solve, and why does this matter for training deep networks over many epochs?
You're training a model on a dataset with 10 million samples. Rank Batch GD, SGD, and Mini-batch GD in terms of (a) per-iteration cost and (b) convergence stability. Which would you choose and why?
Adam combines ideas from which two algorithms? Explain why bias correction is necessary in the early iterations of training.
An FRQ asks you to select an optimizer for fitting a logistic regression model to a moderately-sized convex problem where you need high precision. Would you choose Adam or L-BFGS? Justify your answer using convergence properties.