Fiveable

📡Advanced Signal Processing Unit 6 Review

QR code for Advanced Signal Processing practice questions

6.1 Short-time Fourier transform (STFT)

6.1 Short-time Fourier transform (STFT)

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📡Advanced Signal Processing
Unit & Topic Study Guides

Definition of STFT

The Short-Time Fourier Transform (STFT) lets you analyze how the frequency content of a signal changes over time. The standard Fourier transform gives you a global frequency picture but tells you nothing about when those frequencies occur. For non-stationary signals (where frequency content shifts over time), that's a serious problem. STFT solves it by windowing the signal into short segments and computing the Fourier transform of each one, producing a two-dimensional time-frequency representation.

Fourier Transform for Non-Stationary Signals

The classical Fourier transform assumes stationarity: the frequency content doesn't change over the duration of the signal. Real-world signals like speech, music, and EEG rarely satisfy this assumption. Their spectral characteristics evolve continuously.

STFT addresses this by localizing the analysis in time. Rather than transforming the entire signal at once, you isolate short segments where the signal is approximately stationary, then transform each segment independently.

Time-Frequency Representation

STFT maps a one-dimensional time-domain signal onto a two-dimensional time-frequency plane. Each point (t,f)(t, f) in this plane indicates the magnitude (and phase) of frequency component ff at time tt.

This representation lets you visualize spectral evolution directly. A pure tone that sweeps upward in frequency, for instance, appears as a diagonal ridge in the time-frequency plane rather than a smeared blob across all frequencies.

Sliding Window Approach

The mechanics of STFT follow a straightforward procedure:

  1. Choose a window function w[n]w[n] of length LL.

  2. Position the window at time index nn.

  3. Multiply the signal x[m]x[m] by the shifted window w[nm]w[n - m] to extract a local segment.

  4. Compute the Fourier transform of that windowed segment.

  5. Slide the window forward by a hop size HH (where H<LH < L if you want overlap) and repeat.

The result is a series of time-localized spectra, one per window position, that together form the full time-frequency representation.

Properties of STFT

The usefulness of your STFT output depends heavily on how you configure the analysis. The core properties below govern what you can and can't resolve.

Time and Frequency Resolution Trade-off

This is the single most important concept in STFT analysis. Time resolution is your ability to pinpoint when something happens; frequency resolution is your ability to distinguish which frequencies are present.

  • A longer window captures more oscillation cycles, giving you finer frequency resolution but smearing events across a wider time interval.
  • A shorter window localizes events precisely in time but blurs the frequency spectrum because fewer cycles are observed.

You cannot improve both simultaneously. Every STFT configuration is a compromise.

Window Size vs. Frequency Resolution

The frequency resolution of the STFT is approximately ΔffsL\Delta f \approx \frac{f_s}{L}, where fsf_s is the sampling rate and LL is the window length in samples. Doubling the window length roughly halves Δf\Delta f, giving you twice the frequency detail.

The trade-off cost: the time resolution is approximately ΔtLfs\Delta t \approx \frac{L}{f_s}. So that same doubling of LL also doubles the time interval over which spectral information is averaged. If the signal's frequency content changes rapidly within that interval, you'll miss it.

Choosing the optimal window size requires knowing something about your signal. For speech, window lengths of 20–40 ms are typical because speech can be considered quasi-stationary over that duration.

Overlap Between Windows

Adjacent windows typically share a fraction of their samples. This overlap serves two purposes:

  • Temporal continuity: Without overlap, you get coarse time sampling of the spectral evolution. Overlap increases the number of spectral snapshots per unit time.
  • Artifact reduction: Window functions taper the signal at the edges, attenuating samples near window boundaries. Overlap ensures those attenuated regions are captured at full weight in neighboring windows.

Common overlap values are 50% to 75% of the window length. Higher overlap gives smoother time-frequency representations at the cost of increased computation (more frames to process).

Computation of STFT

Discrete-Time STFT

For digital implementation, the STFT operates on sampled signals. The discrete-time STFT is defined as:

X[n,k]=m=x[m]w[nm]ej2πNmkX[n, k] = \sum_{m=-\infty}^{\infty} x[m] \, w[n - m] \, e^{-j\frac{2\pi}{N}mk}

where:

  • x[m]x[m] is the input signal
  • w[nm]w[n - m] is the window function centered at time index nn
  • NN is the DFT length (number of frequency bins)
  • kk is the frequency bin index

In practice, the summation limits are finite because w[n]w[n] has finite support of length LL. The DFT length NN is often chosen as a power of 2 and may be larger than LL (zero-padding) to interpolate the frequency axis for smoother spectral estimates.

Efficient Implementation Using FFT

Computing the DFT directly for each frame costs O(N2)O(N^2) operations. The FFT reduces this to O(NlogN)O(N \log N) per frame, which matters enormously when you're processing thousands of frames.

The practical computation steps are:

  1. Extract the nn-th frame: multiply x[m]x[m] by w[nm]w[n - m].

  2. Zero-pad the windowed segment to length NN if N>LN > L.

  3. Apply the FFT to obtain X[n,k]X[n, k] for k=0,1,,N1k = 0, 1, \ldots, N-1.

  4. Advance nn by the hop size HH and repeat.

Spectrogram Representation

The spectrogram is the squared magnitude of the STFT:

S[n,k]=X[n,k]2S[n, k] = |X[n, k]|^2

It's displayed as a 2D image with time on the horizontal axis, frequency on the vertical axis, and color/brightness encoding energy. The spectrogram discards phase information, which is acceptable for visualization and many analysis tasks but not for signal reconstruction (where phase is essential).

Spectrograms are the primary tool for visual inspection of time-frequency content. Sustained tones appear as horizontal ridges, broadband transients appear as vertical streaks, and frequency sweeps (chirps) appear as diagonal traces.

Fourier transform for non-stationary signals, Short-time Fourier transform - Wikipedia

Window Functions for STFT

The window function shapes both the spectral resolution and the leakage characteristics of your STFT. Every window involves a trade-off between mainlobe width (which determines frequency resolution) and sidelobe level (which determines how much energy from one frequency leaks into neighboring bins).

Rectangular Window

The rectangular window assigns uniform weight across all LL samples and zero outside:

w[n]=1,0nL1w[n] = 1, \quad 0 \leq n \leq L-1

It has the narrowest mainlobe of any window of the same length, giving the best nominal frequency resolution. However, its sidelobes are only about 13 dB below the mainlobe peak, causing severe spectral leakage. Weak frequency components near strong ones can be completely masked. For this reason, the rectangular window is rarely used in practice.

Hann and Hamming Windows

Both are raised-cosine windows that taper smoothly to reduce spectral leakage.

  • Hann window: w[n]=0.5(1cos ⁣(2πnL1))w[n] = 0.5\left(1 - \cos\!\left(\frac{2\pi n}{L-1}\right)\right). It tapers to zero at both endpoints, which makes it well-suited for overlap-add reconstruction. First sidelobe is about 32-32 dB.
  • Hamming window: w[n]=0.540.46cos ⁣(2πnL1)w[n] = 0.54 - 0.46\cos\!\left(\frac{2\pi n}{L-1}\right). It doesn't quite reach zero at the endpoints, which gives a slightly narrower mainlobe than Hann but with a first sidelobe around 43-43 dB.

Both are standard choices for general-purpose STFT analysis. The Hann window is often preferred when perfect reconstruction via overlap-add is needed (at 50% overlap).

Gaussian Window

The Gaussian window follows:

w[n]=e12(n(L1)/2σ)2w[n] = e^{-\frac{1}{2}\left(\frac{n - (L-1)/2}{\sigma}\right)^2}

where σ\sigma controls the width. The Gaussian window is unique in that it achieves the minimum time-bandwidth product, meaning it provides the tightest joint time-frequency localization allowed by the uncertainty principle. The parameter σ\sigma directly controls the time-frequency resolution balance.

Gaussian windows are particularly relevant in Gabor analysis and are commonly used when optimal time-frequency concentration is the priority.

Trade-offs Between Window Types

WindowMainlobe WidthSidelobe LevelBest Use Case
RectangularNarrowestHighest (~−13 dB)Rarely used; only when max frequency resolution is critical and leakage is tolerable
HannModerateLow (~−32 dB)General-purpose; overlap-add reconstruction
HammingModerateLower (~−43 dB)General-purpose; slightly better sidelobe suppression than Hann
GaussianAdjustable via σ\sigmaVery low (smooth decay)Optimal time-frequency localization; Gabor analysis
The right choice depends on whether you care more about resolving close frequencies (favor narrower mainlobe) or suppressing leakage from strong components (favor lower sidelobes).

Interpretation of STFT

Time-Frequency Localization

Each STFT coefficient X[n,k]X[n, k] represents the signal's content near time nn and frequency bin kk, weighted by the window function. The "near" part is critical: you're not getting the exact instantaneous frequency at an exact time instant. You're getting a local average over a time-frequency region whose size is set by the window.

This localization lets you track how individual frequency components appear, evolve, and disappear over the signal's duration.

Identification of Signal Components

Different signal features produce characteristic patterns in the spectrogram:

  • Harmonics: Appear as parallel horizontal lines at the fundamental frequency and its integer multiples.
  • Formants (in speech): Show up as bands of concentrated energy at resonant frequencies of the vocal tract. They shift over time as the speaker articulates different sounds.
  • Transients (clicks, onsets, percussive hits): Appear as vertical lines or short vertical bursts, since they contain broadband energy concentrated in a brief time interval.
  • Chirps (frequency sweeps): Appear as diagonal ridges whose slope indicates the rate of frequency change.

Visualization of the Spectrogram

Spectrograms are typically displayed on a log-frequency or linear-frequency vertical axis, with magnitude shown in decibels (dB scale) to compress the dynamic range. Color maps range from grayscale to perceptually uniform palettes like viridis.

When reading a spectrogram, bright regions indicate high energy at that time-frequency location, and dark regions indicate low energy or silence. The spectrogram is one of the most widely used tools for exploratory analysis in speech, music, bioacoustics, and fault detection.

Applications of STFT

Speech Processing and Analysis

STFT is the backbone of most speech processing pipelines. Spectral features extracted from STFT frames (such as mel-frequency cepstral coefficients, spectral envelope, and pitch contours) feed into speech recognition, speaker verification, and emotion detection systems. The spectrogram itself is often used directly as input to neural network-based speech models.

Typical STFT parameters for speech: 25 ms window, 10 ms hop size (60% overlap), Hamming or Hann window.

Fourier transform for non-stationary signals, Discrete Time-Frequency Signal Analysis and Processing Techniques for Non-Stationary Signals

Audio Signal Processing

Audio coding standards (like MP3 and AAC) use STFT-based representations for perceptual coding and compression. Source separation algorithms (isolating vocals from a mix, for example) operate in the STFT domain, modifying magnitude and phase before resynthesizing via inverse STFT. Time-stretching and pitch-shifting algorithms also rely on STFT phase vocoder techniques.

Biomedical Signal Analysis

EEG analysis uses STFT to track frequency band power (delta, theta, alpha, beta, gamma) over time, which is essential for sleep staging, seizure detection, and brain-computer interfaces. ECG and EMG analyses similarly benefit from time-frequency views to detect arrhythmias or characterize muscle activation patterns.

Time-Varying Frequency Analysis

Radar and sonar systems use STFT to detect Doppler shifts from moving targets, where the frequency shift changes over time as the target moves. Vibration analysis in mechanical systems uses spectrograms to identify bearing faults, gear mesh frequencies, and other rotating machinery signatures that evolve with operating conditions.

Limitations of STFT

Fixed Time-Frequency Resolution

Once you select a window length, the time-frequency resolution is locked for the entire analysis. A signal might contain both slowly varying tonal components (needing good frequency resolution) and sharp transients (needing good time resolution) simultaneously. STFT forces you to pick one compromise that applies everywhere, which can be suboptimal for signals with diverse time-frequency characteristics.

Uncertainty Principle

The Heisenberg-Gabor uncertainty principle sets a hard lower bound on the joint time-frequency resolution:

ΔtΔf14π\Delta t \cdot \Delta f \geq \frac{1}{4\pi}

No window function, no matter how cleverly designed, can beat this bound. The Gaussian window achieves equality (the minimum product), which is why it's considered optimal in this sense. But even at the theoretical minimum, you cannot have arbitrarily fine resolution in both domains at once.

Alternatives to STFT

The fixed-resolution limitation of STFT has motivated several alternative approaches:

  • Wavelet transform: Uses short windows at high frequencies and long windows at low frequencies, providing multi-resolution analysis that adapts across the frequency axis. This is often more natural for signals with both high-frequency transients and low-frequency oscillations.
  • Wigner-Ville distribution: Provides very high time-frequency resolution but suffers from cross-term interference between signal components, which can make interpretation difficult for multi-component signals.
  • Cohen's class distributions: A generalization that includes the Wigner-Ville distribution and allows kernel-based suppression of cross-terms, at the cost of some resolution.

Each alternative addresses specific shortcomings of STFT but introduces its own trade-offs.

Advanced Topics in STFT

Multi-Taper STFT

Standard STFT uses a single window, which means the spectral estimate at each time frame has high variance (it's based on one realization). Multi-taper STFT addresses this by computing multiple STFTs using a set of orthogonal tapers, typically Slepian sequences (discrete prolate spheroidal sequences, DPSS).

The individual spectral estimates are then averaged, which reduces variance and improves robustness to noise. The trade-off is a slight reduction in frequency resolution (the effective bandwidth widens) and increased computation (one FFT per taper per frame). Multi-taper methods are especially valuable in low-SNR scenarios where single-taper estimates are unreliable.

Synchrosqueezing Transform

Synchrosqueezing is a post-processing step applied to the STFT that sharpens the time-frequency representation. The idea is:

  1. Compute the standard STFT.
  2. Estimate the instantaneous frequency at each time-frequency point from the phase derivative: ω^[n,k]=targ(X[n,k])\hat{\omega}[n, k] = \frac{\partial}{\partial t} \arg(X[n, k]).
  3. Reassign the energy at (n,k)(n, k) to the frequency bin closest to ω^[n,k]\hat{\omega}[n, k].

The result is a much more concentrated representation where each component's energy collapses onto a narrow ridge at its instantaneous frequency, rather than being spread across the mainlobe width of the window. This is particularly useful for separating closely spaced or crossing components. Synchrosqueezing also preserves invertibility, meaning you can reconstruct individual components from the sharpened representation.

Reassignment Method

The reassignment method generalizes synchrosqueezing by correcting localization in both time and frequency. For each STFT coefficient, it computes:

  • The instantaneous frequency (from the time derivative of phase) to correct frequency localization.
  • The group delay (from the frequency derivative of phase) to correct time localization.

Each coefficient is then moved from its nominal grid position (n,k)(n, k) to its reassigned position (n^,k^)(\hat{n}, \hat{k}). The result is a sharper spectrogram with better localization of ridges and transients. Unlike synchrosqueezing, the reassignment method is generally not invertible, so it's primarily a visualization and analysis tool rather than a basis for signal modification.

Adaptive STFT

Adaptive STFT breaks free from the fixed-window constraint by allowing the window length (and potentially shape) to vary over time. The window parameters are adjusted based on local signal characteristics:

  • In regions where the signal is quasi-stationary, a longer window is used for better frequency resolution.
  • In regions with rapid changes or transients, a shorter window is used for better time resolution.

The adaptation can be driven by signal-dependent criteria such as local stationarity measures, entropy minimization, or time-frequency concentration metrics. Adaptive STFT bridges the gap between standard fixed-window STFT and fully multi-resolution methods like the wavelet transform, offering a data-driven compromise.