Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Gradient descent is the workhorse of nonlinear optimizationโit's how we actually solve the problems you've been setting up all semester. When you're minimizing a loss function, training a model, or finding optimal parameters, you're almost certainly using some variant of gradient descent. The exam will test whether you understand not just what these methods do, but why certain variants work better in specific situations: convergence speed, stability, computational cost, and adaptivity are the principles that distinguish one method from another.
Here's the key insight: these methods didn't evolve randomly. Each new technique was designed to fix a specific weakness in earlier approaches. Basic gradient descent is slow? Add momentum. Learning rates too rigid? Make them adaptive. Understanding this evolutionary logic means you can predict which method fits which problemโand that's exactly what FRQs will ask you to do. Don't just memorize update rules; know what problem each method solves and when it fails.
These are the core algorithms that establish the fundamental trade-off in gradient descent: accuracy of gradient estimates versus computational efficiency. Every advanced method builds on these foundations.
Compare: Basic Gradient Descent vs. SGDโboth follow the negative gradient, but SGD trades accuracy for speed by using noisy estimates. If an FRQ asks about scalability to large datasets, SGD (or mini-batch) is your answer; if it asks about guaranteed descent, basic gradient descent wins.
These methods address a critical weakness of vanilla gradient descent: slow progress in directions with small but consistent gradients, and oscillation in directions with large but inconsistent gradients. The key mechanism is accumulating velocity over time.
Compare: Momentum vs. Nesterovโboth accumulate velocity, but Nesterov evaluates the gradient at the anticipated position rather than the current one. This look-ahead mechanism gives NAG better theoretical guarantees and often faster practical convergence.
The core insight here is that different parameters may need different learning rates. Parameters associated with rare features need larger updates; those with frequent features need smaller ones. These methods automatically adjust learning rates based on gradient history.
Compare: Adagrad vs. RMSprop vs. Adamโall adapt learning rates per-parameter, but Adagrad accumulates forever (problematic for long training), RMSprop uses exponential decay (fixes the accumulation issue), and Adam adds momentum on top of RMSprop's adaptation. For exam purposes: Adagrad for sparse data, RMSprop for non-stationary problems, Adam as a reliable default.
Rather than choosing a fixed learning rate , these methods compute the optimal (or near-optimal) step size at each iteration. The trade-off is additional computation per iteration versus faster overall convergence.
Compare: Line Search vs. Trust Regionโline search picks a direction first then finds the step size, while trust region picks a region first then finds the best step within it. Trust region methods are generally more robust for ill-conditioned problems, but line search is simpler to implement and often sufficient for well-behaved objectives.
| Concept | Best Examples |
|---|---|
| Computational efficiency | SGD, Mini-batch Gradient Descent |
| Escaping local minima | SGD (via noise), Momentum |
| Accelerated convergence | Momentum, Nesterov Accelerated Gradient |
| Sparse data handling | Adagrad |
| Adaptive learning rates | Adagrad, RMSprop, Adam |
| Non-stationary objectives | RMSprop, Adam |
| Robust step size selection | Line Search, Trust Region |
| General-purpose deep learning | Adam, Mini-batch with Momentum |
Which two methods both use adaptive per-parameter learning rates but handle gradient accumulation differently? What problem does the second method solve that the first creates?
If you're optimizing a model with highly sparse features (most gradients are zero most of the time), which method would you choose and why?
Compare and contrast Momentum and Nesterov Accelerated Gradient. Both accumulate velocityโwhat specific modification gives NAG its theoretical advantage?
An FRQ describes a non-convex loss landscape where the optimizer keeps getting stuck. Which property of SGD might actually help here, and which momentum-based method could accelerate escape from shallow minima?
You're given a problem where the learning rate decays too quickly and training stalls. Which method is likely being used, and what modification (embodied in which alternative method) would fix this?