The Recursive Least Squares (RLS) algorithm solves a core problem in adaptive filtering: how to update filter coefficients efficiently as new data arrives, without re-solving the entire least squares problem from scratch each time. It does this by recursively minimizing a weighted least squares cost function, which gives it significantly faster convergence than gradient-based methods like LMS.

RLS achieves this speed by exploiting second-order statistics of the input signal (through a recursively updated covariance matrix), rather than relying on instantaneous gradient estimates. The trade-off is computational cost: $O(M^2)$ per time step versus $O(M)$ for LMS, where $M$ is the filter length. Despite this, RLS remains a go-to algorithm when fast adaptation and accurate tracking of time-varying systems are priorities.

Key concepts in RLS

Adaptive filtering

Adaptive filtering adjusts filter coefficients automatically based on the input signal and a desired reference output. The filter "learns" by minimizing the error between its output and the desired signal according to some optimization criterion.

In the RLS context, that criterion is weighted least squares. This makes adaptive filters useful in situations where the signal statistics are unknown or changing over time:

System identification: estimating an unknown system's impulse response
Noise cancellation: subtracting an estimated noise component from a corrupted signal
Channel equalization: compensating for distortions in a communication channel

Least squares estimation

Least squares estimation finds the parameter vector that minimizes the sum of squared errors between observed data and model predictions. Unlike gradient-descent approaches (which take incremental steps toward the optimum), least squares provides a closed-form solution via the normal equations.

In RLS, the least squares criterion is weighted by the forgetting factor $\lambda$ , so recent samples contribute more than older ones. The key advantage is that this closed-form solution can be updated recursively: each new sample refines the previous solution rather than requiring a full recomputation.

Recursive updates

The recursive update mechanism is what makes RLS practical for real-time use. Rather than inverting the autocorrelation matrix from scratch at every time step, RLS applies the matrix inversion lemma to update the inverse incrementally.

Each recursion involves four operations:

Compute the gain vector $\mathbf{k}(n)$ from the current input and previous covariance matrix
Compute the a priori error $e(n)$ using the previous weights
Update the filter weights $\mathbf{w}(n)$ using the gain vector and error
Update the inverse covariance matrix $\mathbf{P}(n)$

This structure keeps the per-sample cost at $O(M^2)$ instead of the $O(M^3)$ that direct matrix inversion would require.

Mathematical formulation

Cost function minimization

RLS minimizes a weighted least squares cost function that exponentially discounts older samples:

$J(n) = \sum_{i=1}^{n} \lambda^{n-i} |d(i) - \mathbf{w}^T(n)\mathbf{x}(i)|^2$

Here, $d(i)$ is the desired signal, $\mathbf{w}(n)$ is the filter weight vector at time $n$ , $\mathbf{x}(i)$ is the input vector, and $\lambda$ is the forgetting factor ( $0 < \lambda \leq 1$ ). The term $\lambda^{n-i}$ means a sample that is $k$ steps in the past gets weighted by $\lambda^k$ , so older data fades exponentially.

Normal equations

Setting the gradient of $J(n)$ with respect to $\mathbf{w}(n)$ to zero yields the normal equations:

$\mathbf{R}(n)\mathbf{w}(n) = \mathbf{r}(n)$

where:

$\mathbf{R}(n) = \sum_{i=1}^{n} \lambda^{n-i} \mathbf{x}(i)\mathbf{x}^T(i)$ is the weighted autocorrelation matrix
$\mathbf{r}(n) = \sum_{i=1}^{n} \lambda^{n-i} \mathbf{x}(i)d(i)$ is the weighted cross-correlation vector

Solving $\mathbf{w}(n) = \mathbf{R}^{-1}(n)\mathbf{r}(n)$ directly at each step would cost $O(M^3)$ . The matrix inversion lemma avoids this.

Matrix inversion lemma

The matrix inversion lemma (Woodbury identity) provides a way to update $\mathbf{R}^{-1}(n)$ from $\mathbf{R}^{-1}(n-1)$ when $\mathbf{R}(n)$ changes by a rank-one update. Since:

$\mathbf{R}(n) = \lambda \mathbf{R}(n-1) + \mathbf{x}(n)\mathbf{x}^T(n)$

the inverse $\mathbf{P}(n) = \mathbf{R}^{-1}(n)$ can be computed recursively without ever performing a full matrix inversion. This is the mathematical backbone that makes RLS computationally feasible.

RLS algorithm derivation

Here is the complete RLS algorithm, step by step:

Step 0: Initialization

Set filter weights: $\mathbf{w}(0) = \mathbf{0}$
Set inverse covariance matrix: $\mathbf{P}(0) = \delta^{-1}\mathbf{I}$ , where $\delta$ is a small positive constant (e.g., $\delta = 0.01$ ). A smaller $\delta$ means a larger initial $\mathbf{P}$ , reflecting high initial uncertainty.
Choose forgetting factor $\lambda$ , typically in the range $[0.95, 1.0]$ . A common default is $\lambda = 0.99$ .

Step 1: Gain vector calculation

For each new time step $n$ , compute the gain vector:

$\mathbf{k}(n) = \frac{\mathbf{P}(n-1)\mathbf{x}(n)}{\lambda + \mathbf{x}^T(n)\mathbf{P}(n-1)\mathbf{x}(n)}$

The gain vector controls how much the new sample influences the weight update. The denominator is a scalar, so this computation costs $O(M^2)$ (one matrix-vector product plus an inner product).

Step 2: Error signal computation

Compute the a priori estimation error using the previous weights:

$y(n) = \mathbf{w}^T(n-1)\mathbf{x}(n)$

$e(n) = d(n) - y(n)$

This is the error before the weights are updated, which is what drives the correction.

Step 3: Filter weights update

Update the weight vector:

$\mathbf{w}(n) = \mathbf{w}(n-1) + \mathbf{k}(n)e(n)$

The correction $\mathbf{k}(n)e(n)$ pushes the weights in the direction that reduces the current error, scaled by the gain vector.

Step 4: Covariance matrix update

Update the inverse covariance matrix:

$\mathbf{P}(n) = \lambda^{-1}\left(\mathbf{P}(n-1) - \mathbf{k}(n)\mathbf{x}^T(n)\mathbf{P}(n-1)\right)$

This is the matrix inversion lemma applied recursively. The $\lambda^{-1}$ factor accounts for the exponential weighting. Then return to Step 1 for the next sample.

Properties of RLS

Fast convergence

RLS converges much faster than LMS because it incorporates second-order statistics of the input through the covariance matrix $\mathbf{P}(n)$ . Where LMS convergence speed depends on the eigenvalue spread of the input autocorrelation matrix (and slows dramatically when the spread is large), RLS effectively whitens the input, making its convergence rate nearly independent of the eigenvalue spread.

In practice, RLS can converge to near-optimal weights in roughly $2M$ to $3M$ iterations (where $M$ is the filter order), regardless of the input signal's spectral characteristics.

Tracking ability

The forgetting factor $\lambda$ gives RLS the ability to track time-varying systems. By exponentially down-weighting past data, the algorithm continuously adapts to changes in the underlying system. A smaller $\lambda$ makes the algorithm more responsive to recent changes, while a larger $\lambda$ provides smoother, more stable estimates.

This tracking capability is what makes RLS valuable in non-stationary environments like mobile communication channels, where the system parameters change continuously.

Computational complexity

The per-sample complexity of RLS is $O(M^2)$ due to the matrix-vector operations in the gain vector and covariance matrix updates. Compare this to:

LMS: $O(M)$ per sample
Direct least squares (re-solving from scratch): $O(M^3)$ per sample

For large filter orders, the $O(M^2)$ cost can become significant. Fast RLS variants (lattice-based, fast transversal filters) can reduce this to $O(M)$ , but at the cost of increased algorithmic complexity and potential numerical sensitivity.

Forgetting factor

Role in RLS

The forgetting factor $\lambda$ ( $0 < \lambda \leq 1$ ) controls the effective memory of the algorithm. It determines how quickly the influence of past data decays.

$\lambda = 1$ : infinite memory. All past samples are weighted equally. This is appropriate for stationary environments and is equivalent to growing-window least squares.
$\lambda < 1$ : finite effective memory. The effective window length is approximately $\frac{1}{1 - \lambda}$ . For example, $\lambda = 0.99$ gives an effective window of about 100 samples; $\lambda = 0.95$ gives about 20 samples.

Exponential weighting

The weight assigned to a sample $k$ steps in the past is $\lambda^k$ . This creates an exponential envelope over the data history. For $\lambda = 0.99$ , a sample 100 steps old has weight $0.99^{100} \approx 0.37$ , while a sample 500 steps old has weight $0.99^{500} \approx 0.0067$ . The decay is smooth and continuous, with no hard cutoff.

Trade-off: tracking vs. stability

Choosing $\lambda$ involves a fundamental trade-off:

Smaller $\lambda$ (e.g., 0.95): faster tracking of system changes, but higher misadjustment noise. The algorithm reacts quickly but is more sensitive to measurement noise, and the covariance matrix $\mathbf{P}(n)$ can grow, potentially causing instability.
Larger $\lambda$ (e.g., 0.999): smoother, more stable estimates, but slower response to system changes. The algorithm may lag behind rapid variations.

In practice, the optimal $\lambda$ depends on the rate of system variation relative to the noise level. For slowly varying systems in moderate noise, $\lambda = 0.99$ to $0.999$ is typical. For rapidly changing environments, values closer to $0.95$ may be needed.

Variants and extensions

Sliding window RLS

Instead of exponential weighting over the entire data history, sliding window RLS uses only the most recent $L$ samples. Data older than $L$ steps is discarded entirely. This provides a hard cutoff rather than the soft exponential decay of standard RLS.

Sliding window RLS can be more appropriate when the system undergoes abrupt changes (where you want to completely forget old data) and avoids the gradual covariance growth that can occur with $\lambda < 1$ . The trade-off is additional bookkeeping to manage the window boundaries.

Stabilized fast RLS

Standard RLS can suffer from numerical instability, particularly when $\lambda$ is close to 1 or the input signal is poorly conditioned. The covariance matrix $\mathbf{P}(n)$ may lose symmetry or positive definiteness due to accumulated rounding errors.

Stabilized variants address this through matrix factorization techniques:

Square-root RLS: propagates a square-root factor of $\mathbf{P}(n)$ instead of $\mathbf{P}(n)$ itself, which guarantees positive definiteness
UD factorization: decomposes $\mathbf{P}(n)$ into $\mathbf{U}\mathbf{D}\mathbf{U}^T$ (unit upper triangular times diagonal times transpose), providing better numerical conditioning

These methods add some implementation complexity but are strongly recommended for long-running or safety-critical applications.

QR decomposition-based RLS

QR-RLS reformulates the problem by maintaining a QR decomposition of the (weighted) data matrix rather than explicitly tracking $\mathbf{P}(n)$ . At each time step, the QR factorization is updated using Givens rotations or Householder reflections.

QR-RLS offers superior numerical stability compared to standard RLS and is particularly well-suited for high filter orders or ill-conditioned inputs. The per-sample cost remains $O(M^2)$ , but the constant factor is somewhat larger than standard RLS.

Applications of RLS

System identification

RLS is widely used to estimate the parameters of an unknown system from its input-output data. The adaptive filter models the unknown system, and RLS adjusts the filter weights until the model output matches the actual system output.

This is valuable in control systems (estimating plant dynamics online), acoustic modeling (estimating room impulse responses), and biomedical engineering (modeling physiological systems). RLS is preferred over LMS here when fast, accurate identification is needed or when the system parameters drift over time.

Channel equalization

In digital communications, the transmitted signal is distorted by the channel (multipath propagation, frequency-selective fading). An RLS-based equalizer estimates the inverse of the channel response and applies it to the received signal to recover the original data.

RLS equalizers adapt quickly to changing channel conditions, which is critical in mobile wireless systems where the channel can vary on a millisecond timescale. The fast convergence of RLS means fewer training symbols are needed compared to LMS-based equalizers, improving spectral efficiency.

Noise cancellation

In adaptive noise cancellation, a reference signal correlated with the noise (but not with the desired signal) is filtered by an adaptive filter. The filter output estimates the noise component, which is then subtracted from the noisy observation.

RLS provides faster noise suppression than LMS because it converges more quickly to the optimal noise estimate. Applications include removing 50/60 Hz power line interference from ECG signals, canceling engine noise in hands-free communication systems, and suppressing acoustic echo in teleconferencing.

Comparison with other algorithms

RLS vs. LMS

Property	RLS	LMS
Update basis	Second-order statistics (covariance matrix)	First-order (instantaneous gradient)
Convergence speed	Fast, nearly independent of eigenvalue spread	Slow when eigenvalue spread is large
Per-sample complexity	$O(M^2)$	$O(M)$
Memory requirements	$O(M^2)$ (stores $\mathbf{P}$ matrix)	$O(M)$
Tracking ability	Superior	Adequate for slowly varying systems
Numerical stability	Requires care (factorization methods)	Inherently stable
Tuning parameter	Forgetting factor $\lambda$	Step size $\mu$

The choice between them depends on your constraints. If computational resources are limited and convergence speed is not critical, LMS is simpler and more robust. If you need fast convergence or are dealing with colored (non-white) input signals, RLS is the better choice.

RLS vs. Kalman filter

RLS can be viewed as a special case of the Kalman filter. Specifically, if you model the filter weights as the state vector with a random-walk state transition model and set the process noise covariance appropriately, the Kalman filter equations reduce to the RLS equations.

The Kalman filter is more general:

It handles state-space models with arbitrary dynamics (not just static or random-walk weight models)
It can incorporate known process noise statistics
It provides uncertainty estimates (error covariance) alongside state estimates

RLS is preferred when you specifically need an adaptive FIR filter and don't have a detailed state-space model. The Kalman filter is preferred when you have a physical model of how the system evolves and want to exploit that structure.

Numerical examples

Simulated data

Simulated examples are the standard way to verify an RLS implementation and study its behavior. A typical setup:

Generate a known FIR system with coefficients $\mathbf{w}_{true}$
Drive it with white Gaussian noise to produce the desired signal $d(n)$
Add observation noise at a specified SNR
Run RLS and monitor the weight error norm $\|\mathbf{w}(n) - \mathbf{w}_{true}\|$ over time

You can then vary $\lambda$ , the filter order, or the SNR to observe how convergence speed, steady-state misadjustment, and tracking behavior change. Comparing the RLS learning curve against LMS under the same conditions clearly demonstrates the convergence advantage.

Real-world signals

Real-world examples demonstrate RLS under practical conditions where the signal statistics are unknown and non-stationary. Common demonstrations include:

Acoustic echo cancellation: using a recorded speech signal and a measured room impulse response
ECG denoising: removing power line interference from clinical ECG recordings
Wireless channel equalization: using measured or standards-based channel models (e.g., ITU pedestrian or vehicular channel profiles)

These examples highlight challenges that don't appear in simulations, such as model mismatch (the true system may not be exactly FIR), non-Gaussian noise, and abrupt system changes.

Implementation considerations

Numerical stability

Numerical stability is the most important practical concern when implementing RLS. The covariance matrix $\mathbf{P}(n)$ must remain symmetric and positive definite throughout operation. In finite-precision arithmetic, accumulated rounding errors can violate these properties, leading to divergence.

Practical measures to ensure stability:

Use symmetrization: replace $\mathbf{P}(n)$ with $\frac{1}{2}(\mathbf{P}(n) + \mathbf{P}^T(n))$ periodically
Use square-root or UD factorization variants for long-running applications
Initialize $\mathbf{P}(0)$ carefully; too large a value can cause transient instability, too small can slow initial convergence
Monitor the trace or eigenvalues of $\mathbf{P}(n)$ as a diagnostic

Computational efficiency

For real-time systems, the $O(M^2)$ per-sample cost sets a limit on the maximum filter order at a given sample rate. For example, at a 48 kHz audio sample rate, you have roughly 20 microseconds per sample. Whether $M = 100$ or $M = 1000$ is feasible depends on your processor.

Strategies to improve efficiency:

Exploit symmetry of $\mathbf{P}(n)$ to halve the storage and reduce multiplications
Use fast RLS variants (lattice RLS, fast transversal filter) for $O(M)$ complexity when applicable
Consider block RLS, which processes multiple samples at once and can leverage vectorized hardware
Use fixed-point arithmetic on embedded platforms, though this requires careful scaling analysis

Hardware requirements

RLS requires storing the full $M \times M$ matrix $\mathbf{P}(n)$ , so memory scales as $O(M^2)$ . For $M = 512$ , that's roughly 1 MB in double precision. This is manageable on modern processors but can be a constraint on microcontrollers or low-power DSPs.

For demanding applications (large $M$ , high sample rates), FPGA implementations can exploit the regular structure of the matrix operations for pipelined, parallel execution. GPU acceleration is another option for offline or batch processing scenarios where latency is less critical.