Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Gradient descent is the engine behind nearly every machine learning model you'll encounter—from simple linear regression to deep neural networks with millions of parameters. Understanding how these algorithms navigate the loss landscape isn't just academic; it's the difference between a model that converges efficiently and one that stalls, oscillates, or settles into a poor local minimum. You're being tested on your ability to analyze convergence behavior, compare computational trade-offs, and select appropriate optimizers for different problem structures.
The algorithms in this guide demonstrate core numerical principles: iterative optimization, adaptive step sizes, momentum dynamics, and the trade-off between computational cost and convergence stability. Don't just memorize which algorithm uses which update rule—know why each modification improves performance and when each approach is most appropriate. FRQs often ask you to justify optimizer selection or analyze convergence properties, so focus on the underlying mechanisms.
These foundational algorithms compute gradients directly from data and update parameters proportionally. The key trade-off is between gradient accuracy (using more data) and computational efficiency (using less).
Compare: Batch vs. SGD—both compute gradients from data, but Batch uses all points (low variance, high cost) while SGD uses one (high variance, low cost). Mini-batch interpolates between them. If an FRQ asks about scalability vs. stability trade-offs, this comparison is your go-to example.
Momentum-based approaches accumulate velocity from past gradients to smooth updates and accelerate progress. The core insight is that gradient history provides information about the loss surface's curvature.
Compare: Momentum vs. NAG—both use velocity accumulation, but NAG computes gradients at the projected position rather than the current one. This "lookahead" reduces oscillations near minima. NAG is particularly effective when you need precise convergence in the final stages of optimization.
These algorithms automatically adjust learning rates for each parameter based on gradient history. The key principle is that parameters with large historical gradients should take smaller steps, and vice versa.
Compare: Adagrad vs. RMSprop vs. Adam—all adapt learning rates per-parameter, but Adagrad accumulates indefinitely (good for sparse, convex), RMSprop uses exponential decay (good for non-stationary), and Adam adds momentum on top (best general-purpose). Know that Adam can sometimes generalize worse than SGD with momentum on certain problems.
These methods incorporate or approximate second-derivative information to achieve faster convergence. The trade-off is between the superior convergence rate and the cost of computing or storing curvature information.
Compare: L-BFGS vs. Conjugate Gradient—both exploit curvature without storing the full Hessian, but L-BFGS approximates it from gradient history while CG generates conjugate directions iteratively. L-BFGS is more general; CG is optimal for quadratic/linear problems. Neither handles stochastic gradients well, limiting their use in deep learning.
| Concept | Best Examples |
|---|---|
| Deterministic vs. Stochastic Updates | Batch GD, SGD, Mini-batch |
| Momentum Acceleration | Momentum, NAG |
| Adaptive Learning Rates | Adagrad, RMSprop, Adam |
| Sparse Data Handling | Adagrad, Adam |
| Second-Order Approximation | L-BFGS, Conjugate Gradient |
| Deep Learning Default | Adam, SGD with Momentum |
| Convex Optimization | L-BFGS, Conjugate Gradient, Batch GD |
| Non-Stationary Objectives | RMSprop, Adam |
Which two algorithms both use momentum, and how does NAG's "lookahead" modification improve upon standard momentum?
Compare Adagrad and RMSprop: what problem does RMSprop solve, and why does this matter for training deep networks over many epochs?
You're training a model on a dataset with 10 million samples. Rank Batch GD, SGD, and Mini-batch GD in terms of (a) per-iteration cost and (b) convergence stability. Which would you choose and why?
Adam combines ideas from which two algorithms? Explain why bias correction is necessary in the early iterations of training.
An FRQ asks you to select an optimizer for fitting a logistic regression model to a moderately-sized convex problem where you need high precision. Would you choose Adam or L-BFGS? Justify your answer using convergence properties.