upgrade
upgrade

๐Ÿ“ˆNonlinear Optimization

Key Concepts of Gradient Descent Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Gradient descent is the workhorse of nonlinear optimizationโ€”it's how we actually solve the problems you've been setting up all semester. When you're minimizing a loss function, training a model, or finding optimal parameters, you're almost certainly using some variant of gradient descent. The exam will test whether you understand not just what these methods do, but why certain variants work better in specific situations: convergence speed, stability, computational cost, and adaptivity are the principles that distinguish one method from another.

Here's the key insight: these methods didn't evolve randomly. Each new technique was designed to fix a specific weakness in earlier approaches. Basic gradient descent is slow? Add momentum. Learning rates too rigid? Make them adaptive. Understanding this evolutionary logic means you can predict which method fits which problemโ€”and that's exactly what FRQs will ask you to do. Don't just memorize update rules; know what problem each method solves and when it fails.


Foundational Methods: The Building Blocks

These are the core algorithms that establish the fundamental trade-off in gradient descent: accuracy of gradient estimates versus computational efficiency. Every advanced method builds on these foundations.

Basic Gradient Descent

  • Full-batch gradient computationโ€”uses the entire dataset to compute the exact gradient at each iteration, giving the most accurate descent direction
  • Deterministic updates ensure consistent, predictable behavior but require O(n)O(n) operations per step where nn is dataset size
  • Slow convergence on high-dimensional or ill-conditioned problems makes this impractical for large-scale optimization

Stochastic Gradient Descent (SGD)

  • Single-sample gradient estimatesโ€”updates parameters using just one randomly selected data point, enabling O(1)O(1) cost per iteration
  • Inherent noise in updates can help escape shallow local minima and saddle points, acting as implicit regularization
  • Oscillatory behavior near the optimum requires learning rate schedules or decay to achieve convergence

Mini-batch Gradient Descent

  • Batch size bb balances the variance-computation trade-off, typically ranging from 32 to 512 in practice
  • Vectorized operations exploit GPU parallelism, making this faster than pure SGD despite using more data per update
  • Reduced variance compared to SGD leads to smoother convergence while maintaining computational efficiency

Compare: Basic Gradient Descent vs. SGDโ€”both follow the negative gradient, but SGD trades accuracy for speed by using noisy estimates. If an FRQ asks about scalability to large datasets, SGD (or mini-batch) is your answer; if it asks about guaranteed descent, basic gradient descent wins.


Momentum-Based Methods: Accelerating Convergence

These methods address a critical weakness of vanilla gradient descent: slow progress in directions with small but consistent gradients, and oscillation in directions with large but inconsistent gradients. The key mechanism is accumulating velocity over time.

Momentum

  • Velocity accumulation adds a fraction ฮฒ\beta (typically 0.9) of the previous update to the current one: vt=ฮฒvtโˆ’1+โˆ‡f(xt)v_t = \beta v_{t-1} + \nabla f(x_t)
  • Dampens oscillations in high-curvature directions while accelerating movement along consistent gradient directions
  • Overshooting risk means the algorithm may move past the minimum, though this often self-corrects

Nesterov Accelerated Gradient (NAG)

  • Look-ahead gradientโ€”computes the gradient at xtโˆ’ฮฒvtโˆ’1x_t - \beta v_{t-1} rather than xtx_t, anticipating where momentum will carry the iterate
  • Corrective behavior allows the method to "brake" before overshooting, improving convergence near the optimum
  • Provably faster convergence rate of O(1/t2)O(1/t^2) versus O(1/t)O(1/t) for standard gradient descent on convex problems

Compare: Momentum vs. Nesterovโ€”both accumulate velocity, but Nesterov evaluates the gradient at the anticipated position rather than the current one. This look-ahead mechanism gives NAG better theoretical guarantees and often faster practical convergence.


Adaptive Learning Rate Methods: Per-Parameter Tuning

The core insight here is that different parameters may need different learning rates. Parameters associated with rare features need larger updates; those with frequent features need smaller ones. These methods automatically adjust learning rates based on gradient history.

Adagrad

  • Accumulated squared gradients Gt=โˆ‘ฯ„=1tgฯ„2G_t = \sum_{\tau=1}^{t} g_\tau^2 scale each parameter's learning rate inversely: ฮท/Gt+ฯต\eta / \sqrt{G_t + \epsilon}
  • Automatic adaptation gives larger effective learning rates to infrequent parameters, ideal for sparse data and NLP tasks
  • Monotonically decreasing learning rates can shrink to near-zero prematurely, halting learning before convergence

RMSprop

  • Exponential moving average replaces Adagrad's sum with Gt=ฮณGtโˆ’1+(1โˆ’ฮณ)gt2G_t = \gamma G_{t-1} + (1-\gamma)g_t^2, typically ฮณ=0.9\gamma = 0.9
  • Prevents learning rate collapse by "forgetting" old gradients, maintaining useful learning rates throughout training
  • Non-stationary objectives are handled well because recent gradients matter more than ancient history

Adam

  • Dual moment estimation tracks both first moment (mean) mtm_t and second moment (uncentered variance) vtv_t of gradients
  • Bias correction divides by (1โˆ’ฮฒt)(1 - \beta^{t}) to account for initialization at zero, crucial in early iterations
  • Default optimizer in deep learning due to robust performance across problem types with minimal hyperparameter tuning

Compare: Adagrad vs. RMSprop vs. Adamโ€”all adapt learning rates per-parameter, but Adagrad accumulates forever (problematic for long training), RMSprop uses exponential decay (fixes the accumulation issue), and Adam adds momentum on top of RMSprop's adaptation. For exam purposes: Adagrad for sparse data, RMSprop for non-stationary problems, Adam as a reliable default.


Step Size Selection: Beyond Fixed Learning Rates

Rather than choosing a fixed learning rate ฮท\eta, these methods compute the optimal (or near-optimal) step size at each iteration. The trade-off is additional computation per iteration versus faster overall convergence.

Line Search Methods

  • Optimal step size along the descent direction dtd_t solves minโกฮฑ>0f(xt+ฮฑdt)\min_{\alpha > 0} f(x_t + \alpha d_t), a one-dimensional subproblem
  • Backtracking (Armijo) efficiently finds acceptable step sizes by starting large and reducing until sufficient decrease is achieved
  • Wolfe conditions ensure both sufficient decrease and curvature conditions, guaranteeing convergence for quasi-Newton methods

Trust Region Methods

  • Region-constrained optimization solves minโกโˆฅpโˆฅโ‰คฮ”mk(p)\min_{\|p\| \leq \Delta} m_k(p) where mkm_k is a local quadratic model of the objective
  • Adaptive region size ฮ”\Delta grows when the model predicts well and shrinks when predictions are poor
  • Global convergence guarantees even for non-convex problems, making these methods robust for challenging optimization landscapes

Compare: Line Search vs. Trust Regionโ€”line search picks a direction first then finds the step size, while trust region picks a region first then finds the best step within it. Trust region methods are generally more robust for ill-conditioned problems, but line search is simpler to implement and often sufficient for well-behaved objectives.


Quick Reference Table

ConceptBest Examples
Computational efficiencySGD, Mini-batch Gradient Descent
Escaping local minimaSGD (via noise), Momentum
Accelerated convergenceMomentum, Nesterov Accelerated Gradient
Sparse data handlingAdagrad
Adaptive learning ratesAdagrad, RMSprop, Adam
Non-stationary objectivesRMSprop, Adam
Robust step size selectionLine Search, Trust Region
General-purpose deep learningAdam, Mini-batch with Momentum

Self-Check Questions

  1. Which two methods both use adaptive per-parameter learning rates but handle gradient accumulation differently? What problem does the second method solve that the first creates?

  2. If you're optimizing a model with highly sparse features (most gradients are zero most of the time), which method would you choose and why?

  3. Compare and contrast Momentum and Nesterov Accelerated Gradient. Both accumulate velocityโ€”what specific modification gives NAG its theoretical advantage?

  4. An FRQ describes a non-convex loss landscape where the optimizer keeps getting stuck. Which property of SGD might actually help here, and which momentum-based method could accelerate escape from shallow minima?

  5. You're given a problem where the learning rate decays too quickly and training stalls. Which method is likely being used, and what modification (embodied in which alternative method) would fix this?