Wavelet Transform Basics
The wavelet transform decomposes a signal into a set of basis functions called wavelets that are localized in both time and frequency. This dual localization is what makes it so effective for non-stationary signals, where frequency content changes over time. Unlike the Fourier transform, which gives you global frequency information, the wavelet transform tells you what frequencies are present and when they occur.
Wavelet Definition and Properties
Wavelets are oscillatory, finite-duration functions with zero mean and varying frequency content. Three properties govern how well a wavelet performs for a given task:
- Admissibility condition: The wavelet must satisfy , which guarantees that the transform is invertible and the original signal can be perfectly reconstructed.
- Regularity: Smoother wavelets produce better frequency localization and sparser signal representations. Regularity is tied to the number of continuous derivatives the wavelet possesses.
- Vanishing moments: A wavelet with vanishing moments satisfies for . More vanishing moments mean the wavelet is "blind" to polynomial trends of degree , which improves compression and denoising performance.
Common wavelet choices include Haar, Daubechies, Symlets, and Coiflets, each offering different trade-offs among these properties.
Continuous vs. Discrete Wavelets
- Continuous wavelets are defined over a continuous range of scales and translations , producing a highly redundant representation. This redundancy is useful for detailed signal analysis and feature extraction, but it comes at a computational cost.
- Discrete wavelets sample the scale and translation parameters on a dyadic grid (, ), forming an orthonormal basis for . This eliminates redundancy, enables efficient computation via filter banks, and guarantees perfect reconstruction. Discrete wavelets are the standard choice for compression and denoising.
Wavelet Families and Types
Wavelet families group wavelets that share structural properties such as support size, symmetry, and vanishing moments.
- Daubechies (dbN): Orthogonal wavelets with compact support and the maximum number of vanishing moments for a given support length. They are asymmetric, which can matter in some applications.
- Biorthogonal: Use separate analysis and synthesis wavelet/scaling function pairs that form biorthogonal bases. This allows symmetric wavelets (useful for avoiding phase distortion) while still achieving perfect reconstruction.
- Gaussian wavelets (e.g., Mexican hat): Derived from derivatives of the Gaussian function. They provide optimal time-frequency localization in the Heisenberg sense but lack compact support, so they are used primarily in the CWT rather than the DWT.
Wavelets can be further classified by orthogonality (orthogonal vs. biorthogonal), symmetry (symmetric vs. asymmetric), and regularity.
Continuous Wavelet Transform (CWT)
The CWT provides a continuous-time, multi-resolution representation of a signal by correlating it with scaled and shifted copies of a mother wavelet. It's the natural tool for exploratory time-frequency analysis of non-stationary signals.
CWT Definition and Formula
The CWT of a signal with respect to a mother wavelet is:
- is the scale parameter. It stretches or compresses the wavelet. Smaller compresses the wavelet (capturing higher frequencies); larger stretches it (capturing lower frequencies). Scale is inversely related to frequency: , where is the wavelet's center frequency.
- is the translation parameter, sliding the wavelet along the time axis.
- is the complex conjugate of the mother wavelet.
- The factor normalizes the wavelet's energy across scales.
CWT Scalogram and Interpretation
The scalogram displays (the squared magnitude of the CWT coefficients) as a 2D map with translation on the horizontal axis and scale (or equivalent frequency) on the vertical axis.
Reading a scalogram:
- Bright regions indicate high energy at that particular scale and time location.
- Vertical structures point to transient events (e.g., impulses, edges) that are well-localized in time.
- Horizontal structures indicate sustained oscillatory components at a particular frequency.
- At fine scales (high frequency), you get good time resolution but poor frequency resolution. At coarse scales (low frequency), the reverse holds. This is the Heisenberg trade-off inherent to the CWT.
CWT vs. Fourier Transform
| Property | Fourier Transform | CWT |
|---|---|---|
| Representation | Global frequency only | Local time-frequency |
| Stationarity assumption | Assumes stationarity | No stationarity assumption |
| Time resolution | None (infinite window) | Varies with scale |
| Frequency resolution | Uniform | Varies with scale |
| Best suited for | Stationary signals, spectral analysis | Non-stationary signals, transient detection |
| The CWT's variable resolution is its defining advantage: narrow windows at high frequencies give precise time localization of fast events, while wide windows at low frequencies give precise frequency localization of slow oscillations. |
Discrete Wavelet Transform (DWT)
The DWT provides a non-redundant, computationally efficient decomposition of a signal into approximation and detail coefficients at discrete dyadic scales. It's the workhorse behind practical applications like compression, denoising, and feature extraction.

DWT Definition and Formula
The DWT is implemented through an iterated filter bank. At each decomposition level, the signal passes through a low-pass filter and a high-pass filter , followed by downsampling by 2.
The decomposition formulas for level are:
where is the original signal, are the approximation coefficients (low-frequency content), and are the detail coefficients (high-frequency content) at level .
Reconstruction recovers the signal by upsampling and filtering with synthesis filters and :
where is the total number of decomposition levels. Perfect reconstruction requires that the analysis and synthesis filters satisfy specific algebraic conditions (orthogonality or biorthogonality).
Multiresolution Analysis with DWT
Multiresolution analysis (MRA) is the mathematical framework underpinning the DWT. It constructs a nested sequence of approximation subspaces such that:
Two functions define the MRA:
- Scaling function : Spans the approximation subspace at scale . It acts as a low-pass function capturing coarse structure.
- Wavelet function : Spans the detail (or "wavelet") subspace , which is the orthogonal complement of in . It captures the fine-scale information lost when moving from one resolution to the next.
The key MRA property is that . The DWT implements this decomposition iteratively: at each level, it splits the current approximation space into a coarser approximation plus detail, producing a tree-like structure of coefficients.
DWT Decomposition and Reconstruction
Decomposition (analysis) proceeds level by level:
- Start with the original signal as the level-0 approximation.
- Filter with (low-pass) and downsample by 2 to get approximation coefficients .
- Filter with (high-pass) and downsample by 2 to get detail coefficients .
- Repeat steps 2-3 on to get and , and so on up to level .
The result is one set of approximation coefficients and sets of detail coefficients .
Reconstruction (synthesis) reverses the process:
- Start from the coarsest level .
- Upsample and by 2, filter with synthesis filters and , and sum to recover .
- Repeat, combining each recovered approximation with the next finer detail level, until you reach the original signal.
Perfect reconstruction holds when the filter bank satisfies the alias cancellation and no-distortion conditions.
Wavelet Filter Banks and Coefficients
The filter bank is the computational engine of the DWT. Its properties directly determine the quality of the decomposition:
- Orthogonality or biorthogonality: Ensures perfect reconstruction and energy preservation (Parseval's relation holds for orthogonal filter banks).
- Finite impulse response (FIR): All practical wavelet filters are FIR, guaranteeing stability and (for symmetric filters) linear phase.
- Vanishing moments: The number of vanishing moments of the wavelet equals the number of zeros the high-pass filter has at . More zeros mean the filter better suppresses polynomial trends, yielding sparser detail coefficients.
The filter length determines the trade-off between time and frequency localization. Shorter filters (e.g., Haar with length 2) give sharp time localization but poor frequency selectivity. Longer filters (e.g., db10 with length 20) give smoother frequency partitioning at the cost of time smearing.
Wavelet Packet Transform (WPT)
The WPT generalizes the DWT by decomposing both approximation and detail coefficients at every level, rather than only the approximation branch. This produces a full binary tree of subbands, giving you much finer control over the time-frequency tiling.
WPT vs. DWT
In the standard DWT, only the low-frequency (approximation) branch is further decomposed at each level. This creates a logarithmic frequency partition: fine resolution at low frequencies, coarse resolution at high frequencies. That's great when most signal energy sits at low frequencies, but it's limiting otherwise.
The WPT decomposes every node, producing a complete binary tree. You can then choose which nodes to keep, effectively selecting an arbitrary tiling of the time-frequency plane. This makes the WPT far more adaptive:
- For signals with significant high-frequency structure, the WPT can provide finer frequency resolution in those bands.
- For classification tasks, you can select the decomposition that best separates signal classes.
The trade-off is increased computation and the need for a basis selection step.
WPT Decomposition Tree
The WPT decomposition tree is a complete binary tree:
- The root node is the original signal.
- At each level, every node splits into two children via low-pass and high-pass filtering plus downsampling.
- Leaf nodes at depth represent the finest frequency partition, with equal-width subbands.
- Each node is indexed by , where is the depth (scale) and is the frequency band index.
The full tree is highly redundant (it contains far more coefficients than the original signal). In practice, you prune the tree to select a non-redundant basis, which brings us to best basis selection.
Best Basis Selection in WPT
Best basis selection finds the optimal subtree (i.e., the set of non-overlapping nodes that covers the entire frequency axis) that minimizes a given cost function. This is what makes the WPT adaptive.
Common cost functions:
- Shannon entropy: . Minimizing entropy yields the sparsest representation.
- Log-energy entropy: . Concentrates energy into fewer coefficients, useful for compression.
- Discriminant measures: Maximize class separability for pattern recognition tasks.
The Coifman-Wickerhauser algorithm solves this efficiently via dynamic programming:
- Compute the cost for every node in the full tree.
- Starting from the leaves, compare each parent's cost to the sum of its children's costs.
- If the parent's cost is lower (or equal), prune the children and keep the parent. Otherwise, keep the children.
- The surviving leaf nodes of the pruned tree form the best basis.
This runs in time, making it practical for real-time applications.

Wavelet Thresholding and Denoising
Wavelet denoising rests on a simple but powerful observation: when you transform a noisy signal into the wavelet domain, the signal energy concentrates in a few large coefficients, while additive white noise spreads roughly uniformly across all coefficients. By zeroing out the small coefficients (which are mostly noise) and keeping the large ones (which are mostly signal), you can recover a clean estimate of the original signal.
Soft vs. Hard Thresholding
Given a threshold :
- Hard thresholding keeps coefficients above the threshold unchanged and zeros out the rest:
This preserves coefficient magnitudes but creates discontinuities at , which can introduce Gibbs-like artifacts in the reconstructed signal.
- Soft thresholding shrinks all coefficients toward zero by :
The shrinkage produces a continuous mapping, yielding smoother reconstructions. Soft thresholding also has better theoretical properties: Donoho and Johnstone showed it is near-minimax optimal for a broad class of function spaces.
In practice, soft thresholding is the default choice unless you have a specific reason to preserve exact coefficient magnitudes.
VisuShrink and SureShrink Methods
These are the two classical approaches to choosing the threshold :
VisuShrink (Donoho & Johnstone, 1994) uses a universal threshold:
where is the noise standard deviation (often estimated from the finest-scale detail coefficients using the MAD estimator: ) and is the signal length. This threshold is conservative: it's chosen so that, with high probability, all pure-noise coefficients fall below it. The downside is over-smoothing, since some signal coefficients also get killed.
SureShrink (Donoho & Johnstone, 1995) adapts the threshold to each subband by minimizing Stein's Unbiased Risk Estimate (SURE), which provides an unbiased estimate of the MSE without knowing the true signal. For each wavelet subband:
- Compute the SURE criterion as a function of .
- Choose the that minimizes SURE.
- If the subband's energy is very low (suggesting it's mostly noise), fall back to the universal VisuShrink threshold.
SureShrink generally outperforms VisuShrink because it adapts to the local signal-to-noise ratio in each subband.
Wavelet-Based Noise Reduction Applications
Wavelet denoising is widely used across domains:
- Image denoising: Removing Gaussian, Poisson, or speckle noise while preserving edges and textures. Wavelet methods handle edge preservation well because edges produce large detail coefficients that survive thresholding.
- Audio and speech enhancement: Suppressing background noise, hum, or recording artifacts from speech signals.
- Biomedical signals: Cleaning ECG, EEG, and fMRI data to improve diagnostic accuracy. For example, wavelet denoising of ECG signals can remove baseline wander and muscle artifact while preserving the QRS complex morphology.
- Seismic data processing: Enhancing signal-to-noise ratio in seismic traces for better subsurface imaging.
- Financial time series: Separating trend and cyclical components from market noise for improved forecasting.
The common thread is that wavelet denoising works well whenever the signal of interest has a sparse wavelet representation and the noise does not.
Wavelet-Based Signal Compression
Wavelet compression exploits the same sparsity that makes denoising work. After transforming a signal into the wavelet domain, most of the energy is packed into a small number of large coefficients. You can discard or coarsely quantize the remaining small coefficients with minimal perceptual or reconstruction error, achieving high compression ratios.
Wavelet Transform in JPEG2000
JPEG2000 replaced the DCT-based JPEG with a DWT-based pipeline, and the improvement is substantial. The compression pipeline works as follows:
- Wavelet decomposition: The image undergoes a dyadic DWT, typically 5-6 levels deep. JPEG2000 uses the CDF 9/7 biorthogonal wavelet for lossy compression and the CDF 5/3 (Le Gall) wavelet for lossless compression.
- Quantization: Wavelet coefficients are scalar-quantized. For lossy compression, a uniform dead-zone quantizer reduces bit depth. For lossless mode, this step is skipped.
- Entropy coding: Quantized coefficients are encoded using EBCOT (Embedded Block Coding with Optimized Truncation), which applies context-adaptive binary arithmetic coding to each code-block independently.
- Rate-distortion optimization: The bitstream is organized so it can be truncated at any point to yield the best possible quality at that rate.
JPEG2000's advantages over JPEG include better compression at low bitrates (no blocking artifacts), quality and resolution scalability, and region-of-interest coding.
Embedded Zerotree Wavelet (EZW) Coding
EZW (Shapiro, 1993) was one of the first algorithms to exploit the cross-scale structure of wavelet coefficients for image compression.
The core insight is that if a wavelet coefficient at a coarse scale is insignificant (below threshold), its descendants at finer scales are very likely also insignificant. This parent-child relationship across scales forms a zerotree, which can be encoded with a single symbol instead of coding each zero individually.
EZW encoding steps:
- Compute the DWT of the image.
- Set the initial threshold .
- Dominant pass: Scan coefficients in a predefined order (typically low-frequency to high-frequency). For each coefficient, encode one of four symbols: positive significant, negative significant, isolated zero, or zerotree root.
- Subordinate pass: Refine previously significant coefficients by encoding the next most significant bit.
- Halve the threshold and repeat from step 3 until the target bitrate is reached.
The bitstream is naturally embedded: you can stop encoding at any point and still have a valid, progressively refined reconstruction.
Set Partitioning in Hierarchical Trees (SPIHT)
SPIHT (Said & Pearlman, 1996) builds on EZW's ideas but uses a more efficient set partitioning strategy that avoids explicit zerotree symbols.
SPIHT maintains three lists:
- LIS (List of Insignificant Sets): Sets of coefficients grouped by spatial orientation trees that haven't yet been found significant.
- LIP (List of Insignificant Pixels): Individual coefficients not yet significant.
- LSP (List of Significant Pixels): Coefficients already identified as significant.
The algorithm proceeds bitplane by bitplane, from the most significant bit to the least:
- Sorting pass: Test each entry in LIP and LIS against the current threshold. When a set in LIS becomes significant, partition it into smaller subsets or individual coefficients. Move newly significant coefficients to LSP.
- Refinement pass: Output the next bit of each coefficient already in LSP.
- Halve the threshold and repeat.
SPIHT consistently outperforms EZW at the same bitrate and produces an embedded bitstream with excellent rate-distortion performance. It requires no explicit entropy coding (the output is already near-optimal), though adding arithmetic coding can squeeze out a few more tenths of a dB.