Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Optimization sits at the heart of scientific computing—whether you're training a neural network, fitting a model to experimental data, or solving engineering design problems, you're asking the same fundamental question: how do we find the best solution? The algorithms in this guide represent different strategies for navigating complex mathematical landscapes, each with trade-offs between speed, accuracy, and computational cost. Understanding these trade-offs is what separates someone who can run code from someone who can solve real problems.
You're being tested on more than just definitions here. Exam questions will ask you to select the right algorithm for a given problem, explain why one method converges faster than another, or analyze when an algorithm might fail. The key concepts are gradient-based vs. derivative-free methods, local vs. global optimization, and computational complexity trade-offs. Don't just memorize what each algorithm does—know what makes it the right (or wrong) choice for a specific scenario.
These algorithms use derivative information to determine which direction leads "downhill" toward a minimum. The gradient points toward the steepest ascent, so we move in the opposite direction. The key differences come down to how much derivative information they use and how they handle step sizes.
Compare: Gradient Descent vs. Newton's Method—both follow derivatives downhill, but Newton's method uses curvature information for faster convergence at the cost of computing the Hessian. If an FRQ asks about trade-offs between convergence speed and computational cost, this is your go-to comparison.
Compare: Newton's Method vs. BFGS—both aim for fast convergence using curvature information, but BFGS approximates the Hessian rather than computing it exactly. Choose Newton when is small and you need maximum speed; choose BFGS when the Hessian is too expensive to compute.
Gradient-based methods can get stuck in local minima. These algorithms sacrifice guaranteed convergence for the ability to explore broadly and find global optima in rugged landscapes. They're essential when you don't know if your objective function is convex.
Compare: Genetic Algorithms vs. Particle Swarm—both are population-based metaheuristics that don't require gradients, but GAs use discrete operations (crossover, mutation) while PSO uses continuous velocity updates. GAs excel in discrete/combinatorial spaces; PSO is often better for continuous optimization.
Many real problems have constraints—budgets, physical limits, or feasibility requirements. These methods find optima while respecting boundaries, either by staying strictly inside the feasible region or by carefully navigating its structure.
Compare: Interior Point vs. Trust Region—both handle constrained optimization but with different philosophies. Interior point methods work globally by staying inside constraints; trust region methods work locally by controlling how far each step can go. Interior point dominates for linear/convex programs; trust region excels for nonlinear problems with complex curvature.
| Concept | Best Examples |
|---|---|
| First-order gradient methods | Gradient Descent, SGD, Conjugate Gradient |
| Second-order / curvature methods | Newton's Method, BFGS, Trust Region |
| Large-scale / memory-limited | Conjugate Gradient, BFGS, SGD |
| Global optimization | Simulated Annealing, Genetic Algorithms, Particle Swarm |
| Population-based search | Genetic Algorithms, Particle Swarm Optimization |
| Constrained optimization | Interior Point Methods, Trust Region Methods |
| Deep learning training | SGD (and variants like Adam, RMSprop) |
| Black-box / derivative-free | Genetic Algorithms, Particle Swarm, Simulated Annealing |
Which two algorithms both use second-order derivative information (or approximations of it), and what's the key computational trade-off between them?
You're optimizing a function with many local minima and no access to gradient information. Which category of algorithms should you consider, and name two specific methods?
Compare and contrast SGD and full-batch Gradient Descent: when would the "noise" in SGD actually be an advantage rather than a drawback?
An FRQ describes a linear programming problem with 10,000 variables and 50,000 constraints. Which algorithm would likely outperform the simplex method, and why?
Both Conjugate Gradient and BFGS avoid computing the full Hessian matrix. What problem type is Conjugate Gradient specifically designed for, and how does this differ from BFGS's typical use case?