Spectral subtraction is a frequency-domain technique for reducing additive noise from corrupted signals, most commonly speech. The core idea: if you can estimate what the noise spectrum looks like on its own, you can subtract it from the noisy signal's spectrum to recover an approximation of the clean signal. This technique assumes the noise is uncorrelated with the clean signal and that the noise spectrum can be estimated during periods when only noise is present (e.g., speech pauses).

Additive noise model

The foundation of spectral subtraction is the additive noise model. In the time domain, the noisy signal $y(n)$ is modeled as the sum of the clean signal $x(n)$ and the noise $d(n)$ :

$y(n) = x(n) + d(n)$

Taking this into the frequency domain via the DFT, the relationship carries over directly:

$Y(k) = X(k) + D(k)$

The goal is to estimate the clean signal spectrum $\hat{X}(k)$ by subtracting a noise spectrum estimate $\hat{D}(k)$ from the observed noisy spectrum:

$\hat{X}(k) = Y(k) - \hat{D}(k)$

This looks straightforward on paper, but the entire challenge of spectral subtraction lies in getting $\hat{D}(k)$ right and handling the consequences when it's imperfect.

Noisy signal spectrum

The noisy signal spectrum $Y(k)$ is obtained by applying the Short-Time Fourier Transform (STFT) to $y(n)$ . The STFT segments the signal into overlapping frames, windows each frame, and computes the DFT of each windowed frame. This produces a time-frequency representation.

For spectral subtraction, you work with two components of $Y(k)$ :

The magnitude spectrum $|Y(k)|$ , which carries the amplitude information at each frequency bin
The phase spectrum $\angle Y(k)$ , which is typically left unmodified and reused during reconstruction (the human ear is relatively insensitive to phase distortions in many contexts)

Noise spectrum estimation

Spectral subtraction requires an estimate of the noise spectrum $\hat{D}(k)$ . The noise spectrum is typically estimated during periods of speech absence or noise-only segments, under the assumption that the noise is stationary or slowly varying relative to the speech signal.

Common noise estimation techniques include:

Voice activity detection (VAD): Classifies frames as speech or non-speech, then averages the spectrum during non-speech frames
Minimum statistics noise estimation: Tracks the minimum power spectral density over sliding time windows
Recursive averaging: Smoothly updates the noise estimate using a weighted combination of the current and previous estimates
Histogram-based methods: Analyze the PSD distribution per frequency bin to separate noise from speech statistically

Each of these is covered in more detail later in this guide.

Signal spectrum estimation

Once $\hat{D}(k)$ is obtained, the clean signal spectrum estimate is computed by subtraction. Depending on the implementation, this subtraction happens in either the magnitude domain or the power domain (discussed in the next section).

The estimated clean magnitude spectrum $|\hat{X}(k)|$ is then combined with the original noisy phase $\angle Y(k)$ to form the complex spectrum. The enhanced time-domain signal is reconstructed using the inverse STFT with overlap-add synthesis.

Spectral subtraction process

The basic subtraction concept is simple, but several parameters and design choices significantly affect both noise reduction quality and artifact behavior. The key decisions involve choosing magnitude vs. power subtraction, setting the over-subtraction factor, applying spectral flooring, and selecting a rectification strategy.

Magnitude vs. power subtraction

Spectral subtraction can operate on magnitudes or on power (squared magnitudes):

Magnitude subtraction: $|\hat{X}(k)| = |Y(k)| - |\hat{D}(k)|$ $∣ \hat{X} (k) ∣ = ∣ Y (k) ∣ - ∣ \hat{D} (k) ∣$
- Computationally simpler
- Can produce negative values, which require rectification (see below)
Power subtraction: $|\hat{X}(k)|^2 = |Y(k)|^2 - |\hat{D}(k)|^2$ $∣ \hat{X} (k) ∣^{2} = ∣ Y (k) ∣^{2} - ∣ \hat{D} (k) ∣^{2}$
- Naturally produces non-negative power estimates (as long as the noisy power exceeds the noise estimate)
- Requires a square root to recover the magnitude spectrum for reconstruction

More generally, these can be unified as $|\hat{X}(k)|^p = |Y(k)|^p - |\hat{D}(k)|^p$ , where $p = 1$ gives magnitude subtraction and $p = 2$ gives power subtraction. The choice of $p$ affects how aggressively low-SNR bins are suppressed.

Over-subtraction factor

The over-subtraction factor $\alpha$ scales the noise estimate before subtraction to compensate for underestimation of the noise spectrum:

$|\hat{X}(k)| = |Y(k)| - \alpha |\hat{D}(k)|$

Typical values range from 1 to 5. Setting $\alpha = 1$ performs standard subtraction. Higher values remove more noise but increase the risk of speech distortion and musical noise artifacts. The right value depends on the SNR and the accuracy of your noise estimate.

Spectral flooring

After subtraction, some frequency bins may have very small or negative values. Spectral flooring prevents the estimated spectrum from dropping below a minimum threshold tied to the noisy signal level:

$|\hat{X}(k)| = \max\left(|\hat{X}(k)|,\; \beta\, |Y(k)|\right)$

The parameter $\beta$ is typically small (e.g., 0.01 to 0.1). Spectral flooring is one of the most effective tools for reducing musical noise, because it ensures a smooth residual noise floor rather than isolated spectral peaks popping in and out across frames.

Half-wave vs. full-wave rectification

When magnitude subtraction produces negative values, rectification is needed:

Half-wave rectification: $|\hat{X}(k)| = \max(|\hat{X}(k)|, 0)$ $∣ \hat{X} (k) ∣ = max (∣ \hat{X} (k) ∣, 0)$
- Clips negative values to zero
- Simple, but zeroing out bins can create spectral holes that contribute to musical noise
Full-wave rectification: $|\hat{X}(k)| = \left| |Y(k)| - |\hat{D}(k)| \right|$ $∣ \hat{X} (k) ∣ = ∣ Y (k) ∣ - ∣ \hat{D} (k) ∣$
- Takes the absolute value of the subtracted result
- Preserves more spectral energy, but the "folded" negative values effectively add noise back in

In practice, half-wave rectification combined with spectral flooring is the more common choice, since full-wave rectification can reintroduce noise components.

Additive noise model, signal detection - Noise reduction from very noisy audio - Signal Processing Stack Exchange

Noise reduction performance

Evaluating spectral subtraction involves both objective metrics and subjective listening tests. The key trade-off is always between removing noise and preserving signal quality. Aggressive subtraction removes more noise but introduces more artifacts.

Signal-to-noise ratio improvement

Signal-to-noise ratio (SNR) measures the relative power of the desired signal versus the noise. The SNR improvement from spectral subtraction is:

$\text{SNR}_{\text{imp}} = \text{SNR}_{\text{enhanced}} - \text{SNR}_{\text{noisy}}$

Higher SNR improvement indicates more noise removed. However, SNR alone is an incomplete measure. A system could achieve high SNR improvement by aggressively zeroing out frequency bins, but the resulting signal might sound terrible due to artifacts. That's why perceptual metrics (PESQ, POLQA) and subjective listening tests are also used in practice.

Spectral subtraction artifacts

Spectral subtraction is a non-linear process, and estimation errors inevitably produce artifacts. The three main categories are:

Musical noise: Isolated tonal artifacts caused by random spectral peaks surviving subtraction
Residual noise: Remaining noise when the estimate undershoots the true noise level
Speech distortion: Loss or warping of speech components when the estimate overshoots

These artifacts tend to worsen at low input SNR, where the noise estimate is less reliable relative to the signal.

Musical noise

Musical noise is the most distinctive and perceptually annoying artifact of spectral subtraction. It manifests as randomly appearing and disappearing tonal components, often described as "tinkles" or "birdies."

The mechanism: frame-to-frame random fluctuations in the noise spectrum mean that some bins are over-subtracted (producing zeros or near-zeros) while neighboring bins are under-subtracted (leaving spectral peaks). These isolated peaks sound tonal.

Strategies to reduce musical noise:

Spectral flooring to maintain a smooth noise floor
Time-frequency smoothing across adjacent bins and frames
Perceptual weighting to focus subtraction where artifacts would be audible

Residual noise

Residual noise is the noise that remains after subtraction, typically because the noise spectrum was underestimated. It's perceived as background hiss or a muffled quality.

Increasing $\alpha$ (the over-subtraction factor) reduces residual noise, but pushes you toward more musical noise and speech distortion. More accurate noise estimation techniques provide a better path to reducing residual noise without this trade-off.

Spectral subtraction variations

Several modifications to the basic algorithm address its limitations in different noise environments.

Multi-band spectral subtraction

Rather than applying a single set of parameters across the entire spectrum, multi-band spectral subtraction divides the frequency range into sub-bands and processes each independently. This is motivated by the fact that noise characteristics and local SNR often vary significantly across frequency.

Each sub-band can have its own over-subtraction factor $\alpha_i$ and spectral floor $\beta_i$ . For example, you might subtract more aggressively in a low-frequency band dominated by noise while being gentler in a mid-frequency band where speech formants carry important information. This frequency-dependent processing generally produces fewer artifacts than full-band processing with a single parameter set.

Non-linear spectral subtraction

Non-linear spectral subtraction adapts the subtraction strength based on the instantaneous SNR at each frequency bin. The idea is to subtract aggressively where the SNR is low (noise-dominated) and gently where the SNR is high (signal-dominated).

This can be implemented through non-linear mapping functions applied to the subtraction gain, such as:

Power-law functions
Logarithmic mappings
Sigmoid-based curves

The result is a better trade-off between noise removal and signal preservation, since high-SNR regions (where the speech is strong) are left largely intact.

Iterative spectral subtraction

Iterative spectral subtraction applies the subtraction process multiple times in sequence. Each iteration uses the enhanced output from the previous pass as its input, and the noise estimate can be refined at each stage.

This progressive approach can improve results, but it requires careful control. Each iteration risks compounding artifacts, and the computational cost scales linearly with the number of passes. Convergence criteria or a fixed small number of iterations (2-3) are typically used.

Additive noise model, Analyzing Speech Signals in Time and Frequency

Spectral subtraction with oversubtraction

This is a formalization of using the over-subtraction factor. The oversubtracted noise estimate is:

$|\hat{D}_{\text{over}}(k)| = \alpha\, |\hat{D}(k)|, \quad \alpha > 1$

This scaled estimate is then used in the standard subtraction:

$|\hat{X}(k)| = |Y(k)| - |\hat{D}_{\text{over}}(k)|$

The distinction from the basic over-subtraction factor described earlier is mainly conceptual: here, oversubtraction is treated as a deliberate design strategy rather than just a tuning parameter. The same trade-off applies: higher $\alpha$ reduces residual noise but increases musical noise risk.

Noise estimation techniques

The quality of the noise estimate is arguably the single most important factor in spectral subtraction performance. A poor noise estimate leads to either residual noise (underestimation) or speech distortion and musical noise (overestimation).

Voice activity detection

Voice activity detection (VAD) classifies each frame as either speech-active or speech-absent. During speech-absent frames, the noise spectrum is estimated directly by averaging the observed spectrum.

VAD algorithms typically use features like:

Short-time energy
Spectral flatness or entropy
Zero-crossing rate
Statistical likelihood ratios

The noise estimate is updated only during detected silence, so the accuracy of the VAD directly determines the quality of the noise estimate. Challenges include misclassifying low-energy speech as noise (causing the noise estimate to be contaminated with speech) and failing to track non-stationary noise that changes during speech-active periods.

Minimum statistics noise estimation

Minimum statistics noise estimation (MSNE), introduced by Martin, tracks the minimum power spectral density of the noisy signal over a sliding time window. The key insight: even during speech, some frequency bins at some time instants will contain mostly noise. Over a sufficiently long window, the minimum PSD in each bin approximates the noise PSD.

A bias correction factor is applied because the minimum of a set of noisy observations systematically underestimates the true mean. A smoothing parameter controls how quickly the estimate adapts.

MSNE is widely used because it does not require explicit VAD and can track slowly varying noise. Its main limitation is that it can underestimate the noise level when noise changes rapidly or when continuous interference is present.

Recursive averaging

Recursive averaging updates the noise estimate frame by frame using an exponential moving average:

$\hat{D}(k, t) = \lambda\, \hat{D}(k, t-1) + (1 - \lambda)\, |Y(k, t)|^2$

The smoothing factor $\lambda$ (typically 0.9 to 0.99) controls the trade-off between tracking speed and estimate stability. Values close to 1 produce a very smooth, slowly adapting estimate.

This technique is simple and computationally cheap, but it has a fundamental problem: during speech-active frames, the noisy spectrum includes speech energy, which inflates the noise estimate. This leads to over-subtraction and speech distortion. To mitigate this, recursive averaging is often combined with VAD so that updates occur only (or primarily) during non-speech frames.

Histogram-based noise estimation

Histogram-based methods build a histogram of the PSD values observed at each frequency bin over time. The noise level is then estimated from the histogram's statistical properties, such as the mode or a low percentile.

The reasoning: in a given frequency bin, the PSD distribution is a mixture of noise-only observations and speech-plus-noise observations. The noise-only component tends to cluster at lower PSD values, so the mode of the histogram often corresponds to the noise level.

These methods can adapt to non-stationary noise and don't require explicit VAD. However, they need enough data to build reliable histograms, and their computational cost is higher than recursive averaging or minimum statistics approaches.

Spectral subtraction enhancements

Beyond the basic algorithm and its variations, several enhancements incorporate psychoacoustic principles or additional filtering stages to improve perceptual quality.

Perceptual weighting

Perceptual weighting adjusts the subtraction strength across frequency based on the sensitivity of human hearing. Frequency regions where the ear is more sensitive receive more careful processing, while regions where the ear is less sensitive can tolerate more aggressive (or less aggressive) subtraction.

The weights can be derived from psychoacoustic models such as:

The absolute threshold of hearing (frequencies where hearing is less sensitive can tolerate more residual noise)
Equal-loudness contours (ISO 226)

By shaping the noise reduction to match auditory sensitivity, perceptual weighting reduces the subjective impact of both residual noise and musical noise artifacts.

Psychoacoustic masking

Psychoacoustic masking exploits the fact that a louder sound can render a quieter sound at nearby frequencies inaudible. In spectral subtraction, this means that noise components falling below the masking threshold of nearby speech components don't need to be removed, because the listener can't hear them anyway.

The process works as follows:

Compute the masking threshold for each frequency bin based on the estimated signal spectrum and a psychoacoustic masking model
Compare the noise level at each bin to the masking threshold
Apply subtraction only where the noise exceeds the masking threshold
Leave noise components below the masking threshold untouched

This preserves a more natural-sounding noise floor and significantly reduces musical noise, since the isolated spectral peaks that cause musical noise are often below the masking threshold and can simply be left alone.

Adaptive spectral subtraction

Adaptive spectral subtraction dynamically adjusts parameters like $\alpha$ and $\beta$ based on local signal characteristics. For instance:

In low-SNR frequency regions, increase $\alpha$ for more aggressive noise removal
In high-SNR regions, decrease $\alpha$ to preserve speech fidelity
Adjust the spectral floor $\beta$ based on the estimated speech presence probability

The adaptation can also incorporate perceptual criteria, such as only increasing subtraction strength where the resulting artifacts would fall below the masking threshold. This frequency- and time-dependent parameter adjustment produces better results than any fixed parameter setting, at the cost of increased algorithmic complexity.

Wiener filtering post-processing

Wiener filtering can be applied as a post-processing step after spectral subtraction to further refine the enhanced signal. The Wiener filter estimates the clean signal by minimizing the mean-square error between the estimated and true clean signals.

The Wiener gain at each frequency bin is:

$G_W(k) = \frac{|\hat{X}(k)|^2}{|\hat{X}(k)|^2 + |\hat{D}(k)|^2}$

This gain is applied to the spectral-subtraction output. In bins where the estimated signal power is much larger than the noise power, the gain approaches 1 (pass through). In bins where noise dominates, the gain approaches 0 (suppress).

Wiener post-processing smooths out the spectral irregularities left by subtraction, reducing musical noise while providing additional noise suppression. The combination of spectral subtraction followed by Wiener filtering is a common and effective pipeline in practical speech enhancement systems.