Newton's method is an iterative algorithm for finding roots of nonlinear equations. It exploits the derivative of a function to steer each successive approximation toward a root, often achieving quadratic convergence. For a Numerical Analysis II course, this method is foundational because it connects Taylor series theory, convergence analysis, and practical computation in one clean framework.

Concept and Motivation

The core idea: if you have a guess $x_n$ that's near a root of $f(x) = 0$ , you can get a better guess by linearizing $f$ at $x_n$ and solving the resulting linear equation. The derivative tells you which direction to move and how far, which is why the method converges so quickly when it works.

Newton's method is motivated by the need for fast, accurate root-finding in fields like engineering, physics, and economics, where nonlinear equations arise constantly and closed-form solutions rarely exist.

Derivation from Taylor Series

The derivation starts with the Taylor expansion of $f$ about the current iterate $x_n$ :

Expand $f(x)$ around $x_n$ : $f(x) = f(x_n) + f'(x_n)(x - x_n) + \frac{1}{2}f''(x_n)(x - x_n)^2 + \cdots$
Truncate after the first-order term, giving the tangent line approximation: $f(x) \approx f(x_n) + f'(x_n)(x - x_n)$
Set this linear approximation to zero and solve for $x$ : $0 = f(x_n) + f'(x_n)(x - x_n)$
Rearrange to obtain the iteration formula: $x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$

The higher-order terms you dropped in step 2 are what produce the approximation error at each step. Those terms also explain why the method converges quadratically: the error at step $n+1$ is proportional to the square of the error at step $n$ .

Geometric Interpretation

Geometrically, each Newton step draws the tangent line to $y = f(x)$ at the point $(x_n, f(x_n))$ , then takes the x-intercept of that tangent line as the next guess $x_{n+1}$ .

When the function is smooth and the guess is close to the root, the tangent line is a good local model, so the x-intercept lands much closer to the true root.
The method can fail when the tangent line is nearly horizontal ( $f'(x_n) \approx 0$ ), because the x-intercept shoots off to a distant point.
For functions with multiple roots, the tangent line may direct the iteration toward an unintended root depending on the starting point.

Sketching a few iterations by hand is one of the best ways to build intuition for when and why the method succeeds or fails.

Algorithm Implementation

Basic Iterative Formula

The algorithm repeatedly applies:

$x_{n+1} = x_n - \frac{f(x_n)}{f'(x_n)}$

In pseudocode:

</>Code
Input: f, f', x_0, tol, max_iter
for n = 0, 1, 2, ... , max_iter:
    x_{n+1} = x_n - f(x_n) / f'(x_n)
    if stopping criterion met:
        return x_{n+1}
return failure (did not converge)

You need both $f$ and $f'$ as callable functions. The derivative must be available either analytically, via numerical differentiation, or through automatic differentiation.

Stopping Criteria

You need to decide when the iteration is "close enough." Common criteria include:

Absolute step size: $|x_{n+1} - x_n| < \epsilon$
Relative step size: $\frac{|x_{n+1} - x_n|}{|x_{n+1}|} < \epsilon$
Residual test: $|f(x_{n+1})| < \epsilon$

No single criterion is foolproof. A small step size doesn't guarantee a small residual (think of a nearly flat function), and a small residual doesn't guarantee you're close to the root (think of a steep function). In practice, combine at least two of these. Always include a maximum iteration count to prevent infinite loops if the method diverges.

Initial Guess Selection

The choice of $x_0$ can make or break Newton's method. Strategies include:

Domain knowledge: Physical constraints or problem context often suggest a reasonable range.
Graphical analysis: Plot $f(x)$ and pick $x_0$ near where the curve crosses zero.
Bracketing first: Run a few bisection steps to narrow the interval, then switch to Newton's method for fast finishing.

If the function has multiple roots, you may need to run the method with several different initial guesses to find all of them.

Convergence Analysis

Local vs. Global Convergence

Local convergence means the method converges when $x_0$ is already close to a root. Newton's method has excellent local convergence properties under mild smoothness assumptions.

Global convergence asks whether the method converges from any starting point. Newton's method does not guarantee global convergence. For arbitrary $x_0$ , the iterates can oscillate, diverge, or converge to an unintended root. Globalization strategies (line search, trust regions) are modifications that improve the chance of convergence from distant starting points.

Rate of Convergence

The order of convergence $p$ is defined by:

$\lim_{n \to \infty} \frac{|e_{n+1}|}{|e_n|^p} = C$

where $e_n = x_n - r$ is the error at iteration $n$ and $C$ is a nonzero constant.

For a simple root (multiplicity 1), Newton's method achieves quadratic convergence ( $p = 2$ ). This means the number of correct digits roughly doubles with each iteration. For example, if you have 3 correct digits at step $n$ , you'll have about 6 at step $n+1$ and about 12 at step $n+2$ .

For roots of multiplicity $m > 1$ , convergence degrades to linear ( $p = 1$ ) unless you use a modified formula (see Limitations section).

Conditions for Convergence

Sufficient conditions for Newton's method to converge to a root $r$ :

$f$ is continuously differentiable in a neighborhood of $r$
$f'(r) \neq 0$ (the root is simple)
$x_0$ is sufficiently close to $r$
$f'$ satisfies a Lipschitz condition near $r$ , which bounds how fast the derivative changes

The Kantorovich theorem provides a more rigorous, verifiable set of conditions: given bounds on $|f(x_0)|$ , $|f'(x_0)^{-1}|$ , and the Lipschitz constant of $f'$ , it guarantees convergence and gives an error bound without requiring you to already know where the root is.

Concept and motivation, Newton's method - Wikipedia

Variations and Extensions

Secant Method

When computing $f'(x)$ is expensive or the derivative isn't available in closed form, the secant method replaces the exact derivative with a finite difference approximation:

$x_{n+1} = x_n - \frac{f(x_n)(x_n - x_{n-1})}{f(x_n) - f(x_{n-1})}$

This requires two initial points ( $x_0$ and $x_1$ ) but only one function evaluation per step (versus one $f$ and one $f'$ for Newton). The convergence order is superlinear at $p \approx 1.618$ (the golden ratio), which is slower than quadratic but often faster in wall-clock time because each step is cheaper.

Quasi-Newton Methods

In higher dimensions, computing and inverting the full Jacobian (or Hessian for optimization) at every step is expensive. Quasi-Newton methods build and update an approximation to the Jacobian (or its inverse) using information gathered from successive iterates.

BFGS (Broyden-Fletcher-Goldfarb-Shanno): the most widely used quasi-Newton method for optimization; updates an approximation to the inverse Hessian.
Broyden's method: a rank-one update to the Jacobian approximation for systems of equations.

These methods trade some convergence speed for a large reduction in per-iteration cost, and they typically achieve superlinear convergence.

Multidimensional Newton's Method

For a system of $n$ nonlinear equations $\mathbf{f}(\mathbf{x}) = \mathbf{0}$ , the iteration generalizes to:

$\mathbf{x}_{n+1} = \mathbf{x}_n - [J(\mathbf{x}_n)]^{-1} \mathbf{f}(\mathbf{x}_n)$

where $J(\mathbf{x}_n)$ is the $n \times n$ Jacobian matrix with entries $J_{ij} = \frac{\partial f_i}{\partial x_j}$ . In practice, you never explicitly invert $J$ . Instead, you solve the linear system:

$J(\mathbf{x}_n) \, \Delta \mathbf{x} = -\mathbf{f}(\mathbf{x}_n)$

then update $\mathbf{x}_{n+1} = \mathbf{x}_n + \Delta \mathbf{x}$ . This is more efficient and numerically stable. The main challenges are the cost of forming $J$ (which has $n^2$ entries) and the possibility that $J$ becomes singular or nearly singular during iteration.

Practical Considerations

Computational Cost

Each Newton iteration requires:

One evaluation of $f(x)$
One evaluation of $f'(x)$ (or an approximation)
For multidimensional problems: forming the Jacobian ( $n^2$ partial derivatives) and solving an $n \times n$ linear system ( $O(n^3)$ via direct methods)

Compare this to bisection, which needs only one $f$ evaluation per step and no derivatives, but converges linearly. Newton's method typically needs far fewer iterations to reach a given tolerance, so the total cost is often lower despite the higher per-iteration expense.

Handling of Derivatives

Three main approaches for obtaining $f'$ :

Analytical derivatives: Exact and fast to evaluate once derived, but can be tedious or impractical for complicated functions.
Numerical differentiation: Approximate $f'(x_n) \approx \frac{f(x_n + h) - f(x_n)}{h}$ for small $h$ . Simple to implement, but introduces truncation error (from finite $h$ ) and round-off error (from small $h$ ). Choosing $h$ well matters; $h \approx \sqrt{\epsilon_{\text{mach}}} \cdot |x_n|$ is a common heuristic.
Automatic differentiation (AD): Computes exact derivatives algorithmically by applying the chain rule to the program that evaluates $f$ . Gives machine-precision derivatives without manual derivation. This is the preferred approach in modern scientific computing.

Numerical Stability Issues

Several things can go wrong in floating-point arithmetic:

Division by near-zero: If $f'(x_n) \approx 0$ , the step $f(x_n)/f'(x_n)$ becomes huge, sending the iterate far away. Detect this and fall back to a safer method.
Overflow/underflow: Very large or very small intermediate values can exceed floating-point range.
Cancellation error: When $x_{n+1}$ and $x_n$ are nearly equal, computing their difference loses significant digits.

Mitigation techniques include scaling variables so that quantities are of moderate size, using regularization (adding a small term to prevent division by zero), and switching to extended precision for critical calculations.

Applications and Examples

Root-Finding Problems

Newton's method solves $f(x) = 0$ across many domains. A concrete example:

Find a root of $f(x) = x^3 - 2x - 5$ . Here $f'(x) = 3x^2 - 2$ . Starting from $x_0 = 2$ :

$x_1 = 2 - \frac{8 - 4 - 5}{12 - 2} = 2 - \frac{-1}{10} = 2.1$
$x_2 = 2.1 - \frac{(2.1)^3 - 2(2.1) - 5}{3(2.1)^2 - 2} \approx 2.0946$

After just a few iterations, you converge to the root $r \approx 2.09455$ . Transcendental equations like $e^x = \cos(x)$ and implicit curve intersections are handled the same way.

Optimization Techniques

To find a minimum or maximum of $g(x)$ , set $f(x) = g'(x) = 0$ and apply Newton's method to $f$ . The iteration becomes:

$x_{n+1} = x_n - \frac{g'(x_n)}{g''(x_n)}$

This is exactly Newton's method applied to the first-order optimality condition. In higher dimensions, this generalizes to the Newton step using the Hessian matrix, and it's often combined with line search or trust region strategies to ensure the step actually decreases the objective.

Concept and motivation, Solving System of Nonlinear Equations with the Genetic Algorithm and Newton’s Method - TIB AV-Portal

Nonlinear Systems of Equations

Systems of coupled nonlinear equations arise in chemical equilibrium calculations, nonlinear circuit analysis, and robotics (inverse kinematics). These require the multidimensional Newton's method with the Jacobian. The main practical challenges are forming the Jacobian efficiently and dealing with the possibility of multiple solutions, which means different initial guesses may lead to different roots.

Limitations and Challenges

Multiple Roots

For a root of multiplicity $m > 1$ (where $f(r) = f'(r) = \cdots = f^{(m-1)}(r) = 0$ ), standard Newton's method degrades to linear convergence. Two common fixes:

Modified Newton's method: Use $x_{n+1} = x_n - m \cdot \frac{f(x_n)}{f'(x_n)}$ if you know the multiplicity $m$ .
Deflation: After finding one root, divide it out (replace $f(x)$ with $f(x)/(x - r)$ ) and apply Newton's method again to find remaining roots.

The method can also converge to an unexpected root if the initial guess is closer to a different root's basin of attraction.

Slow Convergence Cases

Convergence slows down when:

The root has multiplicity greater than 1 (as above)
$f'$ is very small near the root, making the tangent line nearly horizontal
The problem is poorly scaled, so that variables differ by many orders of magnitude

Acceleration techniques like Aitken's $\Delta^2$ method can speed up a linearly converging sequence. Problem reformulation or variable scaling can also help restore fast convergence.

Divergence Scenarios

Newton's method can diverge entirely when:

The initial guess is too far from any root
The iterates enter an attracting cycle (e.g., bouncing between two points)
The function has vertical asymptotes or discontinuities that the iterates cross

Damped Newton's method addresses this by taking a fraction of the full Newton step: $x_{n+1} = x_n - \alpha \frac{f(x_n)}{f'(x_n)}$ where $0 < \alpha \leq 1$ is chosen (often via a line search) to ensure sufficient decrease in $|f|$ . Trust region methods provide another framework for controlling step size.

Newton's Method vs. Alternatives

Fixed-Point Iteration Comparison

Fixed-point iteration rewrites $f(x) = 0$ as $x = g(x)$ and iterates $x_{n+1} = g(x_n)$ . It's simpler (no derivatives needed) but only converges linearly when $|g'(r)| < 1$ . Newton's method is itself a fixed-point iteration with $g(x) = x - f(x)/f'(x)$ , specifically engineered so that $g'(r) = 0$ at a simple root, which is exactly what produces quadratic convergence.

Bisection Method Comparison

Bisection is the most reliable root-finding method: given an interval $[a, b]$ where $f$ changes sign, it's guaranteed to converge. But it only converges linearly, gaining roughly one binary digit per step. Newton's method is far faster near a root but offers no convergence guarantee without a good starting point.

A common practical strategy is to use bisection to get within a reasonable neighborhood of the root, then switch to Newton's method for rapid finishing. This hybrid approach combines the reliability of bisection with the speed of Newton.

Hybrid Methods

Several hybrid strategies exist:

Newton-bisection: Maintain a bracket $[a, b]$ and take a Newton step if it stays inside the bracket; otherwise, take a bisection step. This guarantees convergence while usually achieving Newton-like speed.
Globally convergent Newton with line search: Accept the Newton direction but choose the step length to ensure $|f|$ decreases sufficiently.
Trust region Newton: Restrict the step to a region where the linear model is trusted, expanding or shrinking the region based on how well the model predicts actual function behavior.

Advanced Topics

Complex Plane Applications

Newton's method extends naturally to functions of a complex variable $z$ . The same formula $z_{n+1} = z_n - f(z_n)/f'(z_n)$ applies, and it can find complex roots of polynomials or solve systems involving complex-valued functions. The convergence theory carries over, though branch cuts and essential singularities introduce complications not present in the real case.

Fractals and Newton's Method

Applying Newton's method to a complex polynomial like $z^3 - 1 = 0$ and coloring each starting point $z_0$ by which root the iteration converges to produces a Newton fractal. The boundaries between basins of attraction are fractal curves with infinite detail. These fractals illustrate sensitive dependence on initial conditions: two starting points arbitrarily close together can converge to different roots. This connects Newton's method to chaos theory and dynamical systems.

Parallel Implementations

For large-scale systems, parallelism can accelerate Newton's method at several levels:

Jacobian assembly: Each row (or block) of the Jacobian can be computed independently.
Linear solves: The Newton linear system can be solved using parallel direct or iterative solvers.
Multiple starting points: Independent Newton iterations from different $x_0$ values can run simultaneously to find multiple roots.

The main challenges are load balancing (especially when different starting points converge at different rates) and communication overhead in distributed-memory settings.

2,589 studying →