The Least Mean Squares (LMS) algorithm is a foundational adaptive filtering technique that iteratively adjusts filter coefficients to minimize the mean square error between a desired signal and the filter's actual output. Because it estimates the gradient from instantaneous data rather than requiring full knowledge of signal statistics, LMS strikes a practical balance between optimality and computational cost, making it the workhorse of real-time adaptive filtering.

LMS can be understood as a stochastic approximation of the Wiener filter. Where the Wiener filter demands second-order statistics (autocorrelation and cross-correlation) that are rarely available in practice, LMS replaces the true gradient with an instantaneous estimate and iterates toward the same solution. This section covers the derivation, implementation, performance trade-offs, major variants, applications, and limitations of LMS.

Overview of LMS Algorithm

The LMS algorithm updates a finite impulse response (FIR) filter's weight vector at every time step so that the output tracks a desired signal as closely as possible. Three properties make it attractive:

Computational simplicity: Each iteration requires only $O(N)$ multiply-accumulate operations, where $N$ is the filter length.
No statistical pre-computation: Unlike the Wiener solution, LMS needs no prior estimate of the autocorrelation matrix $R_{xx}$ or the cross-correlation vector $p_{xd}$ .
Online adaptation: Coefficients update sample-by-sample, so the filter can track slowly changing environments in real time.

Derivation of LMS Algorithm

Wiener Filter vs. LMS Algorithm

The Wiener filter minimizes the mean squared error (MSE) in closed form by solving $w_{opt} = R_{xx}^{-1} p_{xd}$ , where $R_{xx}$ is the input autocorrelation matrix and $p_{xd}$ is the cross-correlation between input and desired signal. This requires stationarity and known second-order statistics.

LMS removes both requirements. It never forms or inverts $R_{xx}$ . Instead, it takes small steps toward $w_{opt}$ using only the data available at each time instant. In a stationary environment with a properly chosen step size, LMS converges (in the mean) to the Wiener solution.

Steepest Descent Method

LMS is derived by first writing the exact steepest descent recursion on the MSE cost surface:

$w(n+1) = w(n) - \mu \nabla_w E\left[e^2(n)\right]$

where:

$w(n)$ is the $N \times 1$ weight vector at iteration $n$
$\mu$ is the step size (learning rate)
$\nabla_w E[e^2(n)]$ is the gradient of the MSE with respect to $w$

Expanding the gradient gives $\nabla_w E[e^2(n)] = -2p_{xd} + 2R_{xx}w(n)$ . This still requires $R_{xx}$ and $p_{xd}$ , so steepest descent alone is not practical for real-time use.

Gradient Estimation in LMS

The key LMS approximation replaces the ensemble-average gradient with its instantaneous estimate. At each sample $n$ :

Compute the filter output: $y(n) = w^T(n) x(n)$
Compute the error: $e(n) = d(n) - y(n)$ , where $d(n)$ is the desired signal.
Approximate the gradient: $\hat{\nabla} = -2 e(n) x(n)$

Substituting into the steepest descent equation (and absorbing the factor of 2 into $\mu$ ) yields the LMS update rule:

$w(n+1) = w(n) + \mu\, e(n)\, x(n)$

This single equation is the entire algorithm. The instantaneous gradient estimate is unbiased (its expected value equals the true gradient), but it is noisy, which is the source of the steady-state misadjustment discussed below.

LMS Algorithm Implementation

A complete LMS iteration proceeds as follows:

Initialize the weight vector $w(0)$ (typically to the zero vector).
Read the new input sample and form the input vector $x(n) = [x(n),\, x(n-1),\, \dots,\, x(n-N+1)]^T$ .
Compute the filter output: $y(n) = w^T(n)\, x(n)$ .
Compute the error: $e(n) = d(n) - y(n)$ .
Update the weights: $w(n+1) = w(n) + \mu\, e(n)\, x(n)$ .
Increment $n$ and return to step 2.

Initialization of Weights

Zero initialization is the most common choice and works well when the error surface has a single global minimum (which is guaranteed for the quadratic MSE surface of a linear FIR filter). Small random initialization can sometimes speed up convergence when the filter operates in a multi-modal landscape (e.g., certain nonlinear extensions), but for standard linear LMS the error surface is unimodal, so the starting point affects convergence time but not the final solution.

Choice of Step Size

The step size $\mu$ controls the fundamental trade-off in LMS:

Too large: Fast initial convergence, but the algorithm oscillates around or diverges from the optimum.
Too small: Stable and low misadjustment, but convergence is slow and tracking of non-stationary signals suffers.

A common practical rule of thumb sets $\mu$ as a fraction of the inverse of the filter length times the input power:

$\mu \approx \frac{1}{N \cdot \hat{\sigma}_x^2}$

where $\hat{\sigma}_x^2$ is an estimate of the input signal power. This keeps the algorithm well within the stability bound discussed next.

Stability Conditions for Convergence

For convergence in the mean, the step size must satisfy:

$0 < \mu < \frac{2}{\lambda_{\max}}$

where $\lambda_{\max}$ is the largest eigenvalue of the input autocorrelation matrix $R_{xx}$ . Since $\text{tr}(R_{xx}) = \sum_{i=1}^{N} \lambda_i = N \sigma_x^2$ and $\lambda_{\max} \leq \text{tr}(R_{xx})$ , a sufficient (conservative) condition is:

$0 < \mu < \frac{2}{N \sigma_x^2}$

Exceeding the upper bound causes the weights to grow without limit.

Convergence of LMS Algorithm

Convergence speed is governed by the eigenvalue spread (condition number) of $R_{xx}$ . Each mode of the weight vector converges at a rate determined by $(1 - \mu \lambda_i)$ , where $\lambda_i$ is the corresponding eigenvalue. If the eigenvalues are tightly clustered, all modes converge at roughly the same rate. A large eigenvalue spread forces a small $\mu$ (to keep the fastest mode stable), which makes the slowest mode converge very slowly.

Performance Analysis of LMS

Mean Squared Error

The MSE at iteration $n$ is $\text{MSE}(n) = E[e^2(n)]$ . It starts at the initial MSE (determined by $w(0)$ ) and decays toward a steady-state value. That steady-state MSE is always higher than the minimum MSE achievable by the Wiener filter because of the noise in the gradient estimate.

Misadjustment in Steady State

Misadjustment $\mathcal{M}$ quantifies how much worse LMS performs compared to the Wiener optimum:

$\mathcal{M} = \frac{\text{MSE}_{\text{excess}}}{J_{\min}}$

where $J_{\min}$ is the minimum MSE (Wiener solution) and $\text{MSE}_{\text{excess}} = \text{MSE}_{\text{steady-state}} - J_{\min}$ . For small step sizes, misadjustment is approximately:

$\mathcal{M} \approx \mu \, \text{tr}(R_{xx}) = \mu \, N \, \sigma_x^2$

This shows that misadjustment grows linearly with both $\mu$ and the filter length $N$ .

Convergence Speed vs. Misadjustment

This is the central trade-off in LMS design:

Step size $\mu$	Convergence speed	Steady-state misadjustment	Tracking ability
Large	Fast	High	Good (but noisy)
Small	Slow	Low	Poor (lags changes)

There is no single "best" $\mu$ . You choose it based on whether your application prioritizes fast adaptation or low residual error.

Tracking Ability of LMS

In non-stationary environments the optimal weight vector drifts over time. LMS can follow this drift, but only if $\mu$ is large enough for the adaptation rate to keep up with the rate of change. A larger $\mu$ improves tracking at the cost of higher misadjustment. If the environment changes faster than LMS can track, the algorithm accumulates a lag error on top of the usual gradient noise.

Variants of LMS Algorithm

Normalized LMS (NLMS)

Standard LMS is sensitive to the input signal level: a loud input effectively amplifies the step size, risking instability. NLMS fixes this by normalizing the update by the instantaneous input power:

$w(n+1) = w(n) + \frac{\mu}{\epsilon + \|x(n)\|^2}\, e(n)\, x(n)$

$\epsilon$ is a small positive regularization constant that prevents division by zero.
The effective step size now adapts automatically: it shrinks when the input is large and grows when the input is small.
NLMS typically converges faster than LMS for inputs with time-varying power and is the default choice in many practical systems.

Variable Step Size LMS

These algorithms let $\mu$ change over time to get the best of both worlds: large steps during initial convergence and small steps near steady state.

Gradient Adaptive Step Size (GASS-LMS): Adjusts $\mu$ based on the correlation between successive gradient estimates.
Error-Squared (ES-LMS): Increases $\mu$ when the error is large and decreases it when the error is small.

Both aim to approach the convergence speed of a large $\mu$ while achieving the low misadjustment of a small $\mu$ .

Leaky LMS

Leaky LMS adds a weight decay term that pulls coefficients toward zero at each step:

$w(n+1) = (1 - \mu \alpha)\, w(n) + \mu\, e(n)\, x(n)$

The leakage factor $\alpha > 0$ prevents coefficient drift, which can occur when the input autocorrelation matrix is singular or nearly singular (e.g., when the input has fewer degrees of freedom than the filter length). The cost is a small bias in the steady-state solution away from the true Wiener optimum.

Sign-Error LMS

Sign-error LMS replaces the error magnitude with its sign:

$w(n+1) = w(n) + \mu\, \text{sign}(e(n))\, x(n)$

This eliminates one multiplication per coefficient update, which matters in hardware implementations with strict power or area budgets. Convergence is slower and misadjustment is higher than standard LMS, but the algorithm still converges to the neighborhood of the optimal solution. Related variants include sign-data LMS (uses $\text{sign}(x(n))$ ) and sign-sign LMS (uses both signs).

Applications of LMS Filtering

Adaptive Noise Cancellation

A reference microphone picks up noise correlated with the interference contaminating the primary signal. The LMS filter models the path from the reference to the primary sensor and subtracts the estimated noise. Typical uses include removing 50/60 Hz power-line interference from ECG recordings, suppressing background noise in hearing aids, and canceling engine noise in active noise control headphones.

System Identification Using LMS

To identify an unknown system, you feed a known input (often white noise) through both the unknown system and an adaptive FIR filter. LMS adjusts the filter weights until its output matches the unknown system's output, at which point the filter coefficients approximate the unknown system's impulse response. This technique is used in control system modeling, acoustic room characterization, and fault detection.

Echo Cancellation with LMS

In telecommunications, a far-end signal leaks back through the hybrid (2-wire to 4-wire converter) or through acoustic coupling in a speakerphone. The LMS filter models the echo path and generates a replica of the echo, which is subtracted from the return signal. Echo cancellers in VoIP systems and hands-free car kits commonly use NLMS due to its robustness to varying signal levels.

Channel Equalization Using LMS

Communication channels introduce intersymbol interference (ISI) through multipath propagation and band-limiting. An LMS equalizer placed at the receiver adapts its coefficients using known training symbols (during a training phase) or decision-directed feedback (during data transmission) to invert the channel response and recover the transmitted symbols. This is standard practice in DSL modems, wireless receivers, and optical communication systems.

Limitations of LMS Algorithm

Sensitivity to Eigenvalue Spread

When the eigenvalue spread $\chi = \lambda_{\max} / \lambda_{\min}$ of $R_{xx}$ is large, the convergence time constant of the slowest mode becomes $\tau_{\max} \approx 1/(\mu \lambda_{\min})$ , which can be orders of magnitude longer than the fastest mode. This is the single biggest practical limitation of LMS.

Slow Convergence for Correlated Inputs

Highly correlated inputs (narrowband signals, speech, signals filtered by long channels) produce autocorrelation matrices with large eigenvalue spreads. The result is the same slow convergence described above. Mitigation strategies include:

Transform-domain LMS: Apply a DFT or DCT to approximately diagonalize $R_{xx}$ , then run independent scalar LMS updates per frequency bin.
Prewhitening: Filter the input to flatten its spectrum before feeding it to LMS.
Affine projection algorithms: Use multiple past input vectors to improve the gradient estimate at the cost of higher complexity.

Performance in Non-Stationary Environments

LMS assumes (at least approximate) stationarity. When the optimal solution changes rapidly, LMS with a fixed step size may lag behind. For fast-varying environments, consider:

Variable step size LMS variants (discussed above)
The Recursive Least Squares (RLS) algorithm, which converges in roughly $N$ iterations regardless of eigenvalue spread, at the cost of $O(N^2)$ complexity per sample
Exponential forgetting windows that down-weight old data

Advanced Topics in LMS

LMS in Frequency Domain

Frequency-domain LMS (FLMS) performs filtering via overlap-save or overlap-add convolution using the FFT, reducing the per-sample cost from $O(N)$ to $O(\log N)$ for very long filters. Two common forms exist:

Unconstrained FLMS (UFLMS): Updates frequency-domain weights directly. Simpler but can produce non-causal filter components.
Constrained FLMS (CFLMS): Projects the frequency-domain weights back to a time-domain constraint (e.g., finite length), preserving the linear convolution property.

FLMS is especially beneficial when $N$ is large (hundreds to thousands of taps), as in acoustic echo cancellation.

Partial Update LMS

When even $O(N)$ per sample is too expensive, partial update strategies reduce computation by updating only a subset of weights each iteration:

Sequential LMS: Cycles through blocks of coefficients in round-robin fashion.
Periodic LMS: Updates all coefficients, but only every $K$ -th sample.
Max-NLMS: Updates only the coefficients corresponding to the largest input components.

These methods trade a modest increase in convergence time for significant savings in multiplications per iteration.

Distributed LMS over Networks

In sensor networks, each node runs a local LMS update on its own data and then shares weight estimates with neighbors via diffusion or consensus protocols. Compared to a centralized approach, distributed LMS offers better scalability, robustness to single-node failures, and reduced communication bandwidth. The diffuse LMS and incremental LMS are two well-studied network topologies for this purpose.

LMS for Nonlinear Filtering

Standard LMS is limited to linear FIR models. Nonlinear extensions map the input into a higher-dimensional feature space and then apply LMS in that space:

Volterra LMS: Uses polynomial expansions of the input (second-order, third-order, etc.) as basis functions. Complexity grows rapidly with nonlinearity order.
Kernel LMS (KLMS): Implicitly maps inputs to a reproducing kernel Hilbert space, enabling nonlinear modeling without explicitly computing the feature map. Requires managing a growing dictionary of support vectors.
Neural network-based LMS: Replaces the linear filter with a neural network trained via backpropagation, with the LMS error signal driving the loss function.

These extensions broaden the scope of adaptive filtering to problems where linear models are fundamentally insufficient, such as nonlinear acoustic echo cancellation and predistortion of power amplifiers.

2,589 studying →