The Short-time Fourier transform (STFT) solves a core problem with the standard Fourier transform: it can't tell you when different frequencies occur in a signal. By windowing the signal into short segments and transforming each one, the STFT produces a time-frequency representation that tracks how spectral content evolves. This makes it essential for analyzing non-stationary signals in speech processing, music analysis, and biomedical applications.

Definition of STFT

The STFT is a time-frequency analysis technique that computes a localized frequency spectrum at successive time positions along a signal. Rather than transforming the entire signal at once (as the standard Fourier transform does), the STFT multiplies the signal by a short window function centered at a given time, then takes the Fourier transform of that windowed segment. Sliding the window forward and repeating the process builds up a full picture of how frequency content changes over time.

Mathematical representation

The continuous STFT of a signal $x(t)$ is defined as:

$X(t, f) = \int_{-\infty}^{\infty} x(\tau) \, w(\tau - t) \, e^{-j2\pi f \tau} \, d\tau$

$w(\tau - t)$ is the window function centered at time $t$ , selecting a local segment of the signal
$f$ is the frequency variable
The integral computes the Fourier transform of the product $x(\tau) \, w(\tau - t)$

Note the convention: the window argument is $(\tau - t)$ , meaning the window is centered at $\tau = t$ . Some references write $w(t - \tau)$ , which is equivalent when the window is symmetric.

The discrete STFT replaces the integral with a summation:

$X[n, k] = \sum_{m=0}^{N-1} x[m + nH] \, w[m] \, e^{-j2\pi k m / N}$

$n$ is the frame (time) index, $k$ is the frequency bin index
$N$ is the DFT length (window length, possibly zero-padded)
$H$ is the hop size (the number of samples the window advances between frames)

Continuous vs discrete STFT

The continuous STFT is a theoretical construct defined for continuous-time signals, yielding a continuous function of both time and frequency. In practice, you work with sampled signals and compute the discrete STFT using the FFT algorithm. The discrete version evaluates the transform at discrete time frames (spaced by the hop size) and discrete frequency bins (spaced by $f_s / N$ , where $f_s$ is the sampling rate).

STFT vs Fourier transform

Both the STFT and the standard Fourier transform decompose a signal into sinusoidal components using complex exponential basis functions, and both produce magnitude and phase spectra.

Similarities in frequency analysis

Both represent the signal in terms of frequency components
Both use the same complex exponential basis $e^{-j2\pi f t}$
The output of both includes magnitude (how strong each frequency is) and phase (the timing alignment of each component)

Differences in time-frequency resolution

The standard Fourier transform integrates over the entire signal, producing a single global spectrum with no time information. You know what frequencies are present, but not when they occur.

The STFT introduces time localization by analyzing short windowed segments. You get a spectrum for each time position, so you can track frequency changes. The cost is reduced frequency resolution: a shorter analysis window means fewer cycles of a sinusoid fit inside it, which broadens the spectral peaks.

Time-frequency resolution tradeoff

The STFT faces a fundamental tradeoff between time and frequency resolution, rooted in the Heisenberg-Gabor uncertainty principle.

Uncertainty principle

The uncertainty principle places a lower bound on the product of time resolution $\Delta t$ and frequency resolution $\Delta f$ :

$\Delta t \cdot \Delta f \geq \frac{1}{4\pi}$

You cannot make both arbitrarily small simultaneously. Narrowing the window (better time resolution) necessarily widens the frequency uncertainty, and vice versa. This is not a limitation of the algorithm; it's a fundamental property of Fourier analysis.

Window size impact

Short window: captures rapid temporal changes (good time resolution), but each segment contains few oscillation cycles, so closely spaced frequencies blur together (poor frequency resolution)
Long window: contains many oscillation cycles, allowing fine frequency discrimination (good frequency resolution), but temporal events get smeared across the window duration (poor time resolution)

Choosing the window length is always application-dependent. For a signal with fast transients (like percussive sounds), you'd favor a shorter window. For a signal with closely spaced tones (like two nearby musical notes), a longer window helps separate them.

Overlap between windows

Overlapping consecutive windows smooths the time-frequency representation and prevents events from being split awkwardly between frames. Typical overlap values are 50% to 75% of the window length. Higher overlap gives denser time sampling (more frames per second) at the cost of increased computation, but it does not improve the underlying time-frequency resolution set by the window size.

STFT parameters

Three interrelated parameters control the STFT's behavior: the window function, the window length, and the hop size.

Window function types

The window function shapes each segment before the FFT. Different windows trade off between mainlobe width (frequency resolution) and sidelobe level (spectral leakage suppression):

Window	Mainlobe Width	Sidelobe Level	Use Case
Rectangular	Narrowest	Highest (-13 dB)	Maximum frequency resolution, but severe leakage
Hann	Moderate	Low (-31 dB)	General-purpose; good leakage suppression
Hamming	Moderate	Lower (-43 dB)	Similar to Hann with better near-sidelobe suppression
Gaussian	Depends on $\sigma$	Very low	Achieves the theoretical minimum uncertainty product
Blackman-Harris	Wide	Very low (-92 dB)	When sidelobe suppression is critical

The Hann window is the most common default choice. The Gaussian window is notable because it achieves the minimum $\Delta t \cdot \Delta f$ product, making it optimal in the uncertainty-principle sense.

Mathematical representation, What is Discrete Fourier Transform(DFT) | ee-diary

Window length selection

The window length $N$ directly sets the time-frequency resolution balance:

Time resolution $\approx N / f_s$ seconds
Frequency resolution $\approx f_s / N$ Hz

For example, at $f_s = 16000$ Hz with $N = 512$ : time resolution is about 32 ms and frequency resolution is about 31.25 Hz. Doubling $N$ to 1024 halves the frequency resolution to ~15.6 Hz but doubles the time resolution to ~64 ms.

Hop size determination

The hop size $H$ is the number of samples between successive window positions. It controls how densely you sample the time axis:

Hop size = window length (no overlap): fastest computation, but temporal gaps may miss events
Hop size = window length / 2 (50% overlap): standard choice balancing density and cost
Hop size = window length / 4 (75% overlap): smoother representation, needed for some reconstruction algorithms

For perfect reconstruction (inverting the STFT back to a time-domain signal), the hop size and window must satisfy the constant overlap-add (COLA) constraint.

STFT computation

Computing the STFT follows a straightforward sliding-window procedure.

Sliding window approach

Place the window function at the beginning of the signal (centered at frame $n = 0$ )
Multiply the signal samples within the window by the window function values, producing a windowed segment
Advance the window by the hop size $H$ samples
Repeat steps 2-3 until the window has traversed the entire signal

FFT of windowed signal

For each windowed segment:

(Optional) Zero-pad the segment to a length that is a power of 2, which speeds up the FFT and interpolates the frequency axis for smoother spectra
Compute the FFT of the windowed (and possibly zero-padded) segment
Store the resulting complex-valued spectrum as one column of the STFT matrix

The output is a 2D complex matrix $X[n, k]$ with time frames along one axis and frequency bins along the other.

Spectrogram representation

The spectrogram is the squared magnitude of the STFT:

$S[n, k] = |X[n, k]|^2$

It's displayed as a 2D image with time on the horizontal axis, frequency on the vertical axis, and color (or intensity) encoding the power at each time-frequency point. Spectrograms are often plotted on a logarithmic (dB) scale to make quiet components visible alongside loud ones.

Interpreting STFT results

Time-frequency plane

Each point $(n, k)$ in the STFT matrix corresponds to a specific time frame and frequency bin. The magnitude at that point tells you how much energy the signal has at that frequency during that time window. Reading across a row (fixed frequency) shows how that frequency's energy evolves over time. Reading down a column (fixed time) gives the instantaneous spectrum at that moment.

Magnitude vs phase information

The STFT coefficients are complex-valued: $X[n,k] = |X[n,k]| \, e^{j\phi[n,k]}$ .

Magnitude $|X[n,k]|$ tells you the strength of each frequency component at each time. This is what the spectrogram displays, and it's what most analysis tasks rely on.
Phase $\phi[n,k]$ encodes the temporal alignment of each component. Phase is critical for signal reconstruction (inverse STFT) and for techniques like the phase vocoder used in time-stretching and pitch-shifting audio.

Identifying signal components

Different signal features produce characteristic patterns in the spectrogram:

Stationary tones appear as horizontal bands at constant frequency
Frequency sweeps (chirps) appear as diagonal or curved traces
Transients (clicks, onsets) appear as vertical lines spanning many frequencies at a single time instant
Harmonics of a periodic signal appear as a stack of evenly spaced horizontal bands
Broadband noise fills a wide frequency range with relatively uniform energy

Applications of STFT

Mathematical representation, Discrete Fourier transform - Wikipedia

Speech processing

Speech recognition: STFT-derived features (like mel-frequency spectrograms) serve as input to recognition models. The time-frequency representation captures formant transitions and phoneme boundaries.
Speech enhancement: Noise components can be identified and suppressed in the time-frequency domain (e.g., spectral subtraction), then the cleaned signal is reconstructed via inverse STFT.
Voice activity detection: Speech segments show structured harmonic patterns in the spectrogram, while silence or noise does not, making it straightforward to detect when someone is speaking.

Music analysis

Pitch estimation: The fundamental frequency and its harmonics appear as distinct peaks in each STFT frame, enabling pitch tracking over time.
Onset detection: Musical note onsets produce transient broadband energy bursts that are visible as vertical features in the spectrogram.
Music transcription: Combining pitch estimation and onset detection across time frames allows conversion of audio into symbolic notation (e.g., MIDI).

Biomedical signal processing

EEG analysis: Brain rhythms (alpha, beta, theta, delta bands) occupy specific frequency ranges. The STFT reveals how these rhythms change during different cognitive states or in response to stimuli.
ECG analysis: Time-frequency representations help detect arrhythmias and other cardiac events that involve transient changes in the heart's electrical activity.
EMG analysis: Muscle fatigue manifests as a shift in the median frequency of the EMG power spectrum over time, which the STFT can track.

Limitations of STFT

Fixed time-frequency resolution

The most significant limitation: once you choose a window length, the time-frequency resolution is locked for the entire analysis. A signal that contains both fast transients and closely spaced steady tones cannot be optimally analyzed with a single fixed window. This is the primary motivation for multi-resolution methods like the wavelet transform.

Spectral leakage

When a signal's frequency components don't fall exactly on the DFT's discrete frequency bins, their energy "leaks" into neighboring bins. This creates artificial sidelobes in the spectrum. Window functions reduce leakage by tapering the segment edges to zero, but they also widen the mainlobe, reducing frequency resolution. There's no window that eliminates leakage entirely without sacrificing resolution.

Boundary effects

At the beginning and end of the signal, the window extends beyond the available data. The standard assumption is that the signal is zero outside its duration, but this introduces discontinuities that distort the spectrum in those edge frames. Common mitigations include zero-padding, reflecting the signal at its boundaries, or simply discarding the edge frames from the analysis.

Advanced STFT techniques

Multi-resolution analysis

Multi-resolution approaches use different window lengths for different frequency bands. Low frequencies (which need fine frequency resolution) get long windows, while high frequencies (which need fine time resolution) get short windows. The wavelet transform implements this naturally through its dyadic scaling structure. Multi-resolution STFT variants achieve similar effects by computing multiple STFTs with different window sizes and combining the results.

Adaptive window sizes

Rather than fixing the window length globally or by frequency band, adaptive methods choose the window size locally based on the signal's characteristics at each time instant. For example, if the signal is locally stationary, a longer window is selected for better frequency resolution; if a transient is detected, the window shortens to capture it precisely. These methods add computational complexity but can significantly improve the representation for signals with diverse time-frequency behavior.

Synchrosqueezing transform

Synchrosqueezing is a post-processing step applied to the STFT that sharpens the time-frequency representation. It works by:

Computing the standard STFT
Estimating the instantaneous frequency at each time-frequency point from the phase derivative: $\omega(t, f) = \frac{\partial}{\partial t} \text{arg}(X(t, f))$
Reassigning each STFT coefficient from its original frequency bin to the bin corresponding to its estimated instantaneous frequency

This concentrates energy more tightly around the true frequency trajectories, producing a crisper spectrogram. Unlike the standard STFT, the synchrosqueezed representation is invertible, meaning you can reconstruct the signal from it. It's particularly effective for signals composed of well-separated amplitude- and frequency-modulated components.