Definition of STFT
The Short-Time Fourier Transform (STFT) lets you analyze how the frequency content of a signal changes over time. The standard Fourier transform gives you a global frequency picture but tells you nothing about when those frequencies occur. For non-stationary signals (where frequency content shifts over time), that's a serious problem. STFT solves it by windowing the signal into short segments and computing the Fourier transform of each one, producing a two-dimensional time-frequency representation.
Fourier Transform for Non-Stationary Signals
The classical Fourier transform assumes stationarity: the frequency content doesn't change over the duration of the signal. Real-world signals like speech, music, and EEG rarely satisfy this assumption. Their spectral characteristics evolve continuously.
STFT addresses this by localizing the analysis in time. Rather than transforming the entire signal at once, you isolate short segments where the signal is approximately stationary, then transform each segment independently.
Time-Frequency Representation
STFT maps a one-dimensional time-domain signal onto a two-dimensional time-frequency plane. Each point in this plane indicates the magnitude (and phase) of frequency component at time .
This representation lets you visualize spectral evolution directly. A pure tone that sweeps upward in frequency, for instance, appears as a diagonal ridge in the time-frequency plane rather than a smeared blob across all frequencies.
Sliding Window Approach
The mechanics of STFT follow a straightforward procedure:
-
Choose a window function of length .
-
Position the window at time index .
-
Multiply the signal by the shifted window to extract a local segment.
-
Compute the Fourier transform of that windowed segment.
-
Slide the window forward by a hop size (where if you want overlap) and repeat.
The result is a series of time-localized spectra, one per window position, that together form the full time-frequency representation.
Properties of STFT
The usefulness of your STFT output depends heavily on how you configure the analysis. The core properties below govern what you can and can't resolve.
Time and Frequency Resolution Trade-off
This is the single most important concept in STFT analysis. Time resolution is your ability to pinpoint when something happens; frequency resolution is your ability to distinguish which frequencies are present.
- A longer window captures more oscillation cycles, giving you finer frequency resolution but smearing events across a wider time interval.
- A shorter window localizes events precisely in time but blurs the frequency spectrum because fewer cycles are observed.
You cannot improve both simultaneously. Every STFT configuration is a compromise.
Window Size vs. Frequency Resolution
The frequency resolution of the STFT is approximately , where is the sampling rate and is the window length in samples. Doubling the window length roughly halves , giving you twice the frequency detail.
The trade-off cost: the time resolution is approximately . So that same doubling of also doubles the time interval over which spectral information is averaged. If the signal's frequency content changes rapidly within that interval, you'll miss it.
Choosing the optimal window size requires knowing something about your signal. For speech, window lengths of 20–40 ms are typical because speech can be considered quasi-stationary over that duration.
Overlap Between Windows
Adjacent windows typically share a fraction of their samples. This overlap serves two purposes:
- Temporal continuity: Without overlap, you get coarse time sampling of the spectral evolution. Overlap increases the number of spectral snapshots per unit time.
- Artifact reduction: Window functions taper the signal at the edges, attenuating samples near window boundaries. Overlap ensures those attenuated regions are captured at full weight in neighboring windows.
Common overlap values are 50% to 75% of the window length. Higher overlap gives smoother time-frequency representations at the cost of increased computation (more frames to process).
Computation of STFT
Discrete-Time STFT
For digital implementation, the STFT operates on sampled signals. The discrete-time STFT is defined as:
where:
- is the input signal
- is the window function centered at time index
- is the DFT length (number of frequency bins)
- is the frequency bin index
In practice, the summation limits are finite because has finite support of length . The DFT length is often chosen as a power of 2 and may be larger than (zero-padding) to interpolate the frequency axis for smoother spectral estimates.
Efficient Implementation Using FFT
Computing the DFT directly for each frame costs operations. The FFT reduces this to per frame, which matters enormously when you're processing thousands of frames.
The practical computation steps are:
-
Extract the -th frame: multiply by .
-
Zero-pad the windowed segment to length if .
-
Apply the FFT to obtain for .
-
Advance by the hop size and repeat.
Spectrogram Representation
The spectrogram is the squared magnitude of the STFT:
It's displayed as a 2D image with time on the horizontal axis, frequency on the vertical axis, and color/brightness encoding energy. The spectrogram discards phase information, which is acceptable for visualization and many analysis tasks but not for signal reconstruction (where phase is essential).
Spectrograms are the primary tool for visual inspection of time-frequency content. Sustained tones appear as horizontal ridges, broadband transients appear as vertical streaks, and frequency sweeps (chirps) appear as diagonal traces.

Window Functions for STFT
The window function shapes both the spectral resolution and the leakage characteristics of your STFT. Every window involves a trade-off between mainlobe width (which determines frequency resolution) and sidelobe level (which determines how much energy from one frequency leaks into neighboring bins).
Rectangular Window
The rectangular window assigns uniform weight across all samples and zero outside:
It has the narrowest mainlobe of any window of the same length, giving the best nominal frequency resolution. However, its sidelobes are only about 13 dB below the mainlobe peak, causing severe spectral leakage. Weak frequency components near strong ones can be completely masked. For this reason, the rectangular window is rarely used in practice.
Hann and Hamming Windows
Both are raised-cosine windows that taper smoothly to reduce spectral leakage.
- Hann window: . It tapers to zero at both endpoints, which makes it well-suited for overlap-add reconstruction. First sidelobe is about dB.
- Hamming window: . It doesn't quite reach zero at the endpoints, which gives a slightly narrower mainlobe than Hann but with a first sidelobe around dB.
Both are standard choices for general-purpose STFT analysis. The Hann window is often preferred when perfect reconstruction via overlap-add is needed (at 50% overlap).
Gaussian Window
The Gaussian window follows:
where controls the width. The Gaussian window is unique in that it achieves the minimum time-bandwidth product, meaning it provides the tightest joint time-frequency localization allowed by the uncertainty principle. The parameter directly controls the time-frequency resolution balance.
Gaussian windows are particularly relevant in Gabor analysis and are commonly used when optimal time-frequency concentration is the priority.
Trade-offs Between Window Types
| Window | Mainlobe Width | Sidelobe Level | Best Use Case |
|---|---|---|---|
| Rectangular | Narrowest | Highest (~−13 dB) | Rarely used; only when max frequency resolution is critical and leakage is tolerable |
| Hann | Moderate | Low (~−32 dB) | General-purpose; overlap-add reconstruction |
| Hamming | Moderate | Lower (~−43 dB) | General-purpose; slightly better sidelobe suppression than Hann |
| Gaussian | Adjustable via | Very low (smooth decay) | Optimal time-frequency localization; Gabor analysis |
| The right choice depends on whether you care more about resolving close frequencies (favor narrower mainlobe) or suppressing leakage from strong components (favor lower sidelobes). |
Interpretation of STFT
Time-Frequency Localization
Each STFT coefficient represents the signal's content near time and frequency bin , weighted by the window function. The "near" part is critical: you're not getting the exact instantaneous frequency at an exact time instant. You're getting a local average over a time-frequency region whose size is set by the window.
This localization lets you track how individual frequency components appear, evolve, and disappear over the signal's duration.
Identification of Signal Components
Different signal features produce characteristic patterns in the spectrogram:
- Harmonics: Appear as parallel horizontal lines at the fundamental frequency and its integer multiples.
- Formants (in speech): Show up as bands of concentrated energy at resonant frequencies of the vocal tract. They shift over time as the speaker articulates different sounds.
- Transients (clicks, onsets, percussive hits): Appear as vertical lines or short vertical bursts, since they contain broadband energy concentrated in a brief time interval.
- Chirps (frequency sweeps): Appear as diagonal ridges whose slope indicates the rate of frequency change.
Visualization of the Spectrogram
Spectrograms are typically displayed on a log-frequency or linear-frequency vertical axis, with magnitude shown in decibels (dB scale) to compress the dynamic range. Color maps range from grayscale to perceptually uniform palettes like viridis.
When reading a spectrogram, bright regions indicate high energy at that time-frequency location, and dark regions indicate low energy or silence. The spectrogram is one of the most widely used tools for exploratory analysis in speech, music, bioacoustics, and fault detection.
Applications of STFT
Speech Processing and Analysis
STFT is the backbone of most speech processing pipelines. Spectral features extracted from STFT frames (such as mel-frequency cepstral coefficients, spectral envelope, and pitch contours) feed into speech recognition, speaker verification, and emotion detection systems. The spectrogram itself is often used directly as input to neural network-based speech models.
Typical STFT parameters for speech: 25 ms window, 10 ms hop size (60% overlap), Hamming or Hann window.

Audio Signal Processing
Audio coding standards (like MP3 and AAC) use STFT-based representations for perceptual coding and compression. Source separation algorithms (isolating vocals from a mix, for example) operate in the STFT domain, modifying magnitude and phase before resynthesizing via inverse STFT. Time-stretching and pitch-shifting algorithms also rely on STFT phase vocoder techniques.
Biomedical Signal Analysis
EEG analysis uses STFT to track frequency band power (delta, theta, alpha, beta, gamma) over time, which is essential for sleep staging, seizure detection, and brain-computer interfaces. ECG and EMG analyses similarly benefit from time-frequency views to detect arrhythmias or characterize muscle activation patterns.
Time-Varying Frequency Analysis
Radar and sonar systems use STFT to detect Doppler shifts from moving targets, where the frequency shift changes over time as the target moves. Vibration analysis in mechanical systems uses spectrograms to identify bearing faults, gear mesh frequencies, and other rotating machinery signatures that evolve with operating conditions.
Limitations of STFT
Fixed Time-Frequency Resolution
Once you select a window length, the time-frequency resolution is locked for the entire analysis. A signal might contain both slowly varying tonal components (needing good frequency resolution) and sharp transients (needing good time resolution) simultaneously. STFT forces you to pick one compromise that applies everywhere, which can be suboptimal for signals with diverse time-frequency characteristics.
Uncertainty Principle
The Heisenberg-Gabor uncertainty principle sets a hard lower bound on the joint time-frequency resolution:
No window function, no matter how cleverly designed, can beat this bound. The Gaussian window achieves equality (the minimum product), which is why it's considered optimal in this sense. But even at the theoretical minimum, you cannot have arbitrarily fine resolution in both domains at once.
Alternatives to STFT
The fixed-resolution limitation of STFT has motivated several alternative approaches:
- Wavelet transform: Uses short windows at high frequencies and long windows at low frequencies, providing multi-resolution analysis that adapts across the frequency axis. This is often more natural for signals with both high-frequency transients and low-frequency oscillations.
- Wigner-Ville distribution: Provides very high time-frequency resolution but suffers from cross-term interference between signal components, which can make interpretation difficult for multi-component signals.
- Cohen's class distributions: A generalization that includes the Wigner-Ville distribution and allows kernel-based suppression of cross-terms, at the cost of some resolution.
Each alternative addresses specific shortcomings of STFT but introduces its own trade-offs.
Advanced Topics in STFT
Multi-Taper STFT
Standard STFT uses a single window, which means the spectral estimate at each time frame has high variance (it's based on one realization). Multi-taper STFT addresses this by computing multiple STFTs using a set of orthogonal tapers, typically Slepian sequences (discrete prolate spheroidal sequences, DPSS).
The individual spectral estimates are then averaged, which reduces variance and improves robustness to noise. The trade-off is a slight reduction in frequency resolution (the effective bandwidth widens) and increased computation (one FFT per taper per frame). Multi-taper methods are especially valuable in low-SNR scenarios where single-taper estimates are unreliable.
Synchrosqueezing Transform
Synchrosqueezing is a post-processing step applied to the STFT that sharpens the time-frequency representation. The idea is:
- Compute the standard STFT.
- Estimate the instantaneous frequency at each time-frequency point from the phase derivative: .
- Reassign the energy at to the frequency bin closest to .
The result is a much more concentrated representation where each component's energy collapses onto a narrow ridge at its instantaneous frequency, rather than being spread across the mainlobe width of the window. This is particularly useful for separating closely spaced or crossing components. Synchrosqueezing also preserves invertibility, meaning you can reconstruct individual components from the sharpened representation.
Reassignment Method
The reassignment method generalizes synchrosqueezing by correcting localization in both time and frequency. For each STFT coefficient, it computes:
- The instantaneous frequency (from the time derivative of phase) to correct frequency localization.
- The group delay (from the frequency derivative of phase) to correct time localization.
Each coefficient is then moved from its nominal grid position to its reassigned position . The result is a sharper spectrogram with better localization of ridges and transients. Unlike synchrosqueezing, the reassignment method is generally not invertible, so it's primarily a visualization and analysis tool rather than a basis for signal modification.
Adaptive STFT
Adaptive STFT breaks free from the fixed-window constraint by allowing the window length (and potentially shape) to vary over time. The window parameters are adjusted based on local signal characteristics:
- In regions where the signal is quasi-stationary, a longer window is used for better frequency resolution.
- In regions with rapid changes or transients, a shorter window is used for better time resolution.
The adaptation can be driven by signal-dependent criteria such as local stationarity measures, entropy minimization, or time-frequency concentration metrics. Adaptive STFT bridges the gap between standard fixed-window STFT and fully multi-resolution methods like the wavelet transform, offering a data-driven compromise.