upgrade
upgrade

💻Applications of Scientific Computing

Common Optimization Algorithms

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Optimization sits at the heart of scientific computing—whether you're training a neural network, fitting a model to experimental data, or solving engineering design problems, you're asking the same fundamental question: how do we find the best solution? The algorithms in this guide represent different strategies for navigating complex mathematical landscapes, each with trade-offs between speed, accuracy, and computational cost. Understanding these trade-offs is what separates someone who can run code from someone who can solve real problems.

You're being tested on more than just definitions here. Exam questions will ask you to select the right algorithm for a given problem, explain why one method converges faster than another, or analyze when an algorithm might fail. The key concepts are gradient-based vs. derivative-free methods, local vs. global optimization, and computational complexity trade-offs. Don't just memorize what each algorithm does—know what makes it the right (or wrong) choice for a specific scenario.


Gradient-Based Methods: Following the Slope

These algorithms use derivative information to determine which direction leads "downhill" toward a minimum. The gradient points toward the steepest ascent, so we move in the opposite direction. The key differences come down to how much derivative information they use and how they handle step sizes.

Gradient Descent

  • Updates parameters iteratively by moving in the direction of f(x)-\nabla f(x), the negative gradient of the objective function
  • Learning rate α\alpha controls step size—too large causes overshooting, too small means painfully slow convergence
  • Best for convex functions where there's a single global minimum; foundational to nearly all machine learning optimization

Stochastic Gradient Descent (SGD)

  • Computes gradients using random subsets of data rather than the full dataset, dramatically reducing computation per iteration
  • Noise from sampling can actually help escape shallow local minima and saddle points—a feature, not a bug
  • Dominates deep learning because full-batch gradient descent is computationally infeasible for millions of training examples

Newton's Method

  • Uses second-order information via the Hessian matrix HH to capture curvature: xn+1=xnH1f(xn)x_{n+1} = x_n - H^{-1}\nabla f(x_n)
  • Converges quadratically near the optimum—much faster than gradient descent when you're close to the solution
  • Computationally expensive because computing and inverting the n×nn \times n Hessian scales as O(n3)O(n^3)

Compare: Gradient Descent vs. Newton's Method—both follow derivatives downhill, but Newton's method uses curvature information for faster convergence at the cost of computing the Hessian. If an FRQ asks about trade-offs between convergence speed and computational cost, this is your go-to comparison.

Conjugate Gradient Method

  • Designed for large sparse systems where storing the full Hessian is impossible—common in physics simulations and PDE solvers
  • Generates conjugate directions that are orthogonal with respect to the matrix AA, guaranteeing convergence in at most nn steps for quadratic problems
  • Memory efficient because it only stores a few vectors rather than an n×nn \times n matrix

Quasi-Newton Methods (BFGS)

  • Approximates the Hessian iteratively from gradient information, avoiding expensive second-derivative calculations
  • BFGS (Broyden-Fletcher-Goldfarb-Shanno) is the gold standard, updating a positive-definite approximation each iteration
  • Balances speed and cost—near-Newton convergence rates without Newton's computational burden

Compare: Newton's Method vs. BFGS—both aim for fast convergence using curvature information, but BFGS approximates the Hessian rather than computing it exactly. Choose Newton when nn is small and you need maximum speed; choose BFGS when the Hessian is too expensive to compute.


Global Optimization: Escaping Local Traps

Gradient-based methods can get stuck in local minima. These algorithms sacrifice guaranteed convergence for the ability to explore broadly and find global optima in rugged landscapes. They're essential when you don't know if your objective function is convex.

Simulated Annealing

  • Allows uphill moves probabilistically, with acceptance probability P=eΔE/TP = e^{-\Delta E / T} controlled by a "temperature" parameter
  • Temperature decreases over time—early exploration gives way to late-stage refinement as the algorithm "cools"
  • Effective for combinatorial problems like traveling salesman, where the landscape has many local minima

Genetic Algorithms

  • Evolves a population of solutions using selection, crossover, and mutation operations inspired by natural selection
  • Crossover combines good solutions to explore promising regions; mutation maintains diversity to avoid premature convergence
  • Ideal for black-box optimization where you can evaluate fitness but can't compute gradients

Particle Swarm Optimization

  • Models collective behavior where each particle adjusts its trajectory based on its personal best and the swarm's global best
  • Update rule combines inertia, cognitive, and social terms—balancing exploration of new regions with exploitation of known good areas
  • Handles multi-modal functions well and requires no gradient information, making it popular for engineering design

Compare: Genetic Algorithms vs. Particle Swarm—both are population-based metaheuristics that don't require gradients, but GAs use discrete operations (crossover, mutation) while PSO uses continuous velocity updates. GAs excel in discrete/combinatorial spaces; PSO is often better for continuous optimization.


Constrained Optimization: Staying in Bounds

Many real problems have constraints—budgets, physical limits, or feasibility requirements. These methods find optima while respecting boundaries, either by staying strictly inside the feasible region or by carefully navigating its structure.

Interior Point Methods

  • Traverses the interior of the feasible region using barrier functions that penalize approaching constraint boundaries
  • Polynomial-time complexity for linear programs—theoretically and practically superior to simplex for large-scale problems
  • Handles inequality constraints naturally by converting them to logarithmic barrier terms in the objective

Trust Region Methods

  • Defines a local region where a quadratic model of the objective is trusted to be accurate
  • Adapts region size dynamically—shrinks when the model predicts poorly, expands when predictions are accurate
  • More robust than line search methods for ill-conditioned problems where the objective function behaves unpredictably

Compare: Interior Point vs. Trust Region—both handle constrained optimization but with different philosophies. Interior point methods work globally by staying inside constraints; trust region methods work locally by controlling how far each step can go. Interior point dominates for linear/convex programs; trust region excels for nonlinear problems with complex curvature.


Quick Reference Table

ConceptBest Examples
First-order gradient methodsGradient Descent, SGD, Conjugate Gradient
Second-order / curvature methodsNewton's Method, BFGS, Trust Region
Large-scale / memory-limitedConjugate Gradient, BFGS, SGD
Global optimizationSimulated Annealing, Genetic Algorithms, Particle Swarm
Population-based searchGenetic Algorithms, Particle Swarm Optimization
Constrained optimizationInterior Point Methods, Trust Region Methods
Deep learning trainingSGD (and variants like Adam, RMSprop)
Black-box / derivative-freeGenetic Algorithms, Particle Swarm, Simulated Annealing

Self-Check Questions

  1. Which two algorithms both use second-order derivative information (or approximations of it), and what's the key computational trade-off between them?

  2. You're optimizing a function with many local minima and no access to gradient information. Which category of algorithms should you consider, and name two specific methods?

  3. Compare and contrast SGD and full-batch Gradient Descent: when would the "noise" in SGD actually be an advantage rather than a drawback?

  4. An FRQ describes a linear programming problem with 10,000 variables and 50,000 constraints. Which algorithm would likely outperform the simplex method, and why?

  5. Both Conjugate Gradient and BFGS avoid computing the full Hessian matrix. What problem type is Conjugate Gradient specifically designed for, and how does this differ from BFGS's typical use case?