📈Nonlinear Optimization

Key Concepts of Gradient Descent Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Gradient descent is the workhorse of nonlinear optimization—it's how we actually solve the problems you've been setting up all semester. When you're minimizing a loss function, training a model, or finding optimal parameters, you're almost certainly using some variant of gradient descent. The exam will test whether you understand not just what these methods do, but why certain variants work better in specific situations: convergence speed, stability, computational cost, and adaptivity are the principles that distinguish one method from another.

Here's the key insight: these methods didn't evolve randomly. Each new technique was designed to fix a specific weakness in earlier approaches. Basic gradient descent is slow? Add momentum. Learning rates too rigid? Make them adaptive. Understanding this evolutionary logic means you can predict which method fits which problem—and that's exactly what FRQs will ask you to do. Don't just memorize update rules; know what problem each method solves and when it fails.

Foundational Methods: The Building Blocks

These are the core algorithms that establish the fundamental trade-off in gradient descent: accuracy of gradient estimates versus computational efficiency. Every advanced method builds on these foundations.

Basic Gradient Descent

Full-batch gradient computation—uses the entire dataset to compute the exact gradient at each iteration, giving the most accurate descent direction
Deterministic updates ensure consistent, predictable behavior but require $O(n)$ operations per step where $n$ is dataset size
Slow convergence on high-dimensional or ill-conditioned problems makes this impractical for large-scale optimization

Stochastic Gradient Descent (SGD)

Single-sample gradient estimates—updates parameters using just one randomly selected data point, enabling $O(1)$ cost per iteration
Inherent noise in updates can help escape shallow local minima and saddle points, acting as implicit regularization
Oscillatory behavior near the optimum requires learning rate schedules or decay to achieve convergence

Mini-batch Gradient Descent

Batch size $b$ balances the variance-computation trade-off, typically ranging from 32 to 512 in practice
Vectorized operations exploit GPU parallelism, making this faster than pure SGD despite using more data per update
Reduced variance compared to SGD leads to smoother convergence while maintaining computational efficiency

Compare: Basic Gradient Descent vs. SGD—both follow the negative gradient, but SGD trades accuracy for speed by using noisy estimates. If an FRQ asks about scalability to large datasets, SGD (or mini-batch) is your answer; if it asks about guaranteed descent, basic gradient descent wins.

Momentum-Based Methods: Accelerating Convergence

These methods address a critical weakness of vanilla gradient descent: slow progress in directions with small but consistent gradients, and oscillation in directions with large but inconsistent gradients. The key mechanism is accumulating velocity over time.

Momentum

Velocity accumulation adds a fraction $\beta$ (typically 0.9) of the previous update to the current one: $v_t = \beta v_{t-1} + \nabla f(x_t)$
Dampens oscillations in high-curvature directions while accelerating movement along consistent gradient directions
Overshooting risk means the algorithm may move past the minimum, though this often self-corrects

Nesterov Accelerated Gradient (NAG)

Look-ahead gradient—computes the gradient at $x_t - \beta v_{t-1}$ rather than $x_t$ , anticipating where momentum will carry the iterate
Corrective behavior allows the method to "brake" before overshooting, improving convergence near the optimum
Provably faster convergence rate of $O(1/t^2)$ versus $O(1/t)$ for standard gradient descent on convex problems

Compare: Momentum vs. Nesterov—both accumulate velocity, but Nesterov evaluates the gradient at the anticipated position rather than the current one. This look-ahead mechanism gives NAG better theoretical guarantees and often faster practical convergence.

Adaptive Learning Rate Methods: Per-Parameter Tuning

The core insight here is that different parameters may need different learning rates. Parameters associated with rare features need larger updates; those with frequent features need smaller ones. These methods automatically adjust learning rates based on gradient history.

Adagrad

Accumulated squared gradients $G_t = \sum_{\tau=1}^{t} g_\tau^2$ scale each parameter's learning rate inversely: $\eta / \sqrt{G_t + \epsilon}$
Automatic adaptation gives larger effective learning rates to infrequent parameters, ideal for sparse data and NLP tasks
Monotonically decreasing learning rates can shrink to near-zero prematurely, halting learning before convergence

RMSprop

Exponential moving average replaces Adagrad's sum with $G_t = \gamma G_{t-1} + (1-\gamma)g_t^2$ , typically $\gamma = 0.9$
Prevents learning rate collapse by "forgetting" old gradients, maintaining useful learning rates throughout training
Non-stationary objectives are handled well because recent gradients matter more than ancient history

Adam

Dual moment estimation tracks both first moment (mean) $m_t$ and second moment (uncentered variance) $v_t$ of gradients
Bias correction divides by $(1 - \beta^{t})$ to account for initialization at zero, crucial in early iterations
Default optimizer in deep learning due to robust performance across problem types with minimal hyperparameter tuning

Compare: Adagrad vs. RMSprop vs. Adam—all adapt learning rates per-parameter, but Adagrad accumulates forever (problematic for long training), RMSprop uses exponential decay (fixes the accumulation issue), and Adam adds momentum on top of RMSprop's adaptation. For exam purposes: Adagrad for sparse data, RMSprop for non-stationary problems, Adam as a reliable default.

Step Size Selection: Beyond Fixed Learning Rates

Rather than choosing a fixed learning rate $\eta$ , these methods compute the optimal (or near-optimal) step size at each iteration. The trade-off is additional computation per iteration versus faster overall convergence.

Line Search Methods

Optimal step size along the descent direction $d_t$ solves $\min_{\alpha > 0} f(x_t + \alpha d_t)$ , a one-dimensional subproblem
Backtracking (Armijo) efficiently finds acceptable step sizes by starting large and reducing until sufficient decrease is achieved
Wolfe conditions ensure both sufficient decrease and curvature conditions, guaranteeing convergence for quasi-Newton methods

Trust Region Methods

Region-constrained optimization solves $\min_{\|p\| \leq \Delta} m_k(p)$ where $m_k$ is a local quadratic model of the objective
Adaptive region size $\Delta$ grows when the model predicts well and shrinks when predictions are poor
Global convergence guarantees even for non-convex problems, making these methods robust for challenging optimization landscapes

Compare: Line Search vs. Trust Region—line search picks a direction first then finds the step size, while trust region picks a region first then finds the best step within it. Trust region methods are generally more robust for ill-conditioned problems, but line search is simpler to implement and often sufficient for well-behaved objectives.

Quick Reference Table

Concept	Best Examples
Computational efficiency	SGD, Mini-batch Gradient Descent
Escaping local minima	SGD (via noise), Momentum
Accelerated convergence	Momentum, Nesterov Accelerated Gradient
Sparse data handling	Adagrad
Adaptive learning rates	Adagrad, RMSprop, Adam
Non-stationary objectives	RMSprop, Adam
Robust step size selection	Line Search, Trust Region
General-purpose deep learning	Adam, Mini-batch with Momentum

Self-Check Questions

Which two methods both use adaptive per-parameter learning rates but handle gradient accumulation differently? What problem does the second method solve that the first creates?
If you're optimizing a model with highly sparse features (most gradients are zero most of the time), which method would you choose and why?
Compare and contrast Momentum and Nesterov Accelerated Gradient. Both accumulate velocity—what specific modification gives NAG its theoretical advantage?
An FRQ describes a non-convex loss landscape where the optimizer keeps getting stuck. Which property of SGD might actually help here, and which momentum-based method could accelerate escape from shallow minima?
You're given a problem where the learning rate decays too quickly and training stalls. Which method is likely being used, and what modification (embodied in which alternative method) would fix this?

📈Nonlinear Optimization

Key Concepts of Gradient Descent Methods

Why This Matters

Foundational Methods: The Building Blocks

Basic Gradient Descent

Stochastic Gradient Descent (SGD)

Mini-batch Gradient Descent

Momentum-Based Methods: Accelerating Convergence

Momentum

Nesterov Accelerated Gradient (NAG)

Adaptive Learning Rate Methods: Per-Parameter Tuning

Adagrad

RMSprop

Adam

Step Size Selection: Beyond Fixed Learning Rates

Line Search Methods

Trust Region Methods

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes