Fiveable

📡Advanced Signal Processing Unit 8 Review

QR code for Advanced Signal Processing practice questions

8.2 Basis pursuit and L1-norm minimization

8.2 Basis pursuit and L1-norm minimization

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📡Advanced Signal Processing
Unit & Topic Study Guides

Basis pursuit definition

Basis pursuit solves a fundamental problem in sparse signal processing: given an underdetermined system Ax=bAx = b where AA is an m×nm \times n matrix with m<nm < n, how do you pick the "best" solution from the infinitely many that exist? Basis pursuit answers this by choosing the solution with the smallest L1-norm, which turns out to strongly favor sparse solutions.

Formally, the optimization problem is:

minx1subject toAx=b\min \|x\|_1 \quad \text{subject to} \quad Ax = b

where x1=i=1nxi\|x\|_1 = \sum_{i=1}^{n} |x_i|.

This formulation is convex, which means you can solve it efficiently with standard optimization tools. That's a huge deal, because the "true" sparsity objective (minimizing the number of nonzero entries, i.e., the L0-norm) is NP-hard.

Sparse signal representation

A signal is sparse if it can be expressed as a linear combination of just a few basis functions chosen from a larger set called a dictionary. For example, a natural image might have millions of pixels but can be well-represented by only a few hundred wavelet coefficients.

This structure is what makes basis pursuit useful. If you know the signal is sparse in some dictionary, you can exploit that prior to recover it from far fewer measurements than the ambient dimension would suggest. Applications span audio compression, image processing, video coding, and biomedical signal analysis.

Underdetermined linear systems

When AA has more columns than rows (m<nm < n), the system Ax=bAx = b is underdetermined and has infinitely many solutions. Without additional constraints, there's no principled way to pick one.

Basis pursuit resolves this ambiguity by imposing a sparsity-promoting objective. The L1-norm acts as a convex relaxation of the L0-norm (which directly counts nonzeros), steering the solution toward vectors with few nonzero entries. This is a form of regularization: you're encoding your prior belief that the true signal is sparse.

L1-norm minimization problem

The L1-norm of a vector xRnx \in \mathbb{R}^n is defined as:

x1=i=1nxi\|x\|_1 = \sum_{i=1}^{n} |x_i|

Minimizing this subject to Ax=bAx = b has two key properties:

  • Sparsity promotion: The L1-norm penalty drives many coefficients exactly to zero, not just close to zero.
  • Convexity: The L1-norm is a convex function, and the constraint set {x:Ax=b}\{x : Ax = b\} is an affine subspace. This makes the entire problem a convex program, solvable in polynomial time.

In practice, this problem can be reformulated as a linear program (LP), which means you can apply well-established LP solvers directly.

L1-norm vs L2-norm minimization

The L2-norm (least-squares) solution to Ax=bAx = b is the minimum energy solution: x=AT(AAT)1bx^* = A^T(AA^T)^{-1}b. It's easy to compute but spreads energy across all coefficients, producing dense solutions. The L1-norm solution, by contrast, concentrates energy into a few large coefficients and zeros out the rest.

This distinction is the core reason L1 minimization is preferred for sparse recovery.

Sparsity-inducing property of L1-norm

The L1-norm's sparsity-inducing behavior comes from the shape of its penalty. The absolute value function xi|x_i| has a sharp corner at zero, which creates a strong "pull" toward exactly zero for small coefficients. The L2-norm penalty xi2x_i^2 is smooth at zero and only shrinks coefficients toward zero without actually reaching it.

Think of it this way: with L2 regularization, you pay a quadratic cost for large coefficients, so the optimizer spreads the load. With L1, the cost is linear, so there's no penalty reduction from splitting one large coefficient into two smaller ones. The optimizer prefers to keep a few large entries and zero out the rest.

Geometric interpretation of L1 and L2-norms

The geometric picture makes the sparsity property visually clear.

  • The L1-norm ball (set of points where x1t\|x\|_1 \leq t) is a diamond (or cross-polytope) with vertices on the coordinate axes.
  • The L2-norm ball (x2t\|x\|_2 \leq t) is a sphere, rotationally symmetric with no preferred direction.

When you minimize a norm subject to the affine constraint Ax=bAx = b, you're "inflating" the norm ball until it first touches the constraint set. The diamond shape of the L1 ball means the first contact point is very likely to be at a vertex, which lies on a coordinate axis, meaning most coordinates are zero. The sphere has no such bias, so the contact point is generically dense.

Unique sparse solutions

Under the right conditions, L1 minimization recovers the unique sparsest solution. The two main sufficient conditions are:

  • Restricted Isometry Property (RIP): If AA satisfies the RIP of order 2s2s with a sufficiently small constant (δ2s<21\delta_{2s} < \sqrt{2} - 1), then any ss-sparse signal is exactly recovered by basis pursuit.
  • Nullspace property: If every vector in the nullspace of AA is "spread out" (not concentrated on a small support), then L1 minimization yields the unique sparse solution.

L2 minimization provides no such guarantee. The minimum-norm least-squares solution is unique but almost never sparse.

Basis pursuit applications

Many real-world signals are sparse or compressible in an appropriate basis. Basis pursuit leverages this structure for efficient acquisition, compression, denoising, and feature extraction.

Compressed sensing

Compressed sensing (CS) is the flagship application. The key insight: if a signal is ss-sparse in some basis, you can recover it from only m=O(slog(n/s))m = O(s \log(n/s)) random linear measurements, far fewer than the nn samples traditional Nyquist-rate sampling would require.

Basis pursuit is the standard recovery algorithm in CS. Practical applications include:

  • MRI acceleration: Acquiring fewer k-space samples and reconstructing via L1 minimization in a wavelet or total-variation basis, reducing scan times significantly.
  • Radar imaging: Recovering sparse target scenes from compressed measurements.
  • Sensor networks: Reducing communication overhead by having each sensor transmit compressed measurements.

Signal denoising

For denoising, you observe b=Ax+eb = Ax + e where ee is noise. The idea is to represent the signal in a sparsifying dictionary (wavelets, curvelets, etc.) and solve a relaxed version of basis pursuit that tolerates some residual error (see BPDN below).

The sparsity prior suppresses noise because noise is typically not sparse in any structured dictionary. By enforcing sparsity, you keep the signal components and discard the noise. This approach works well for audio, images, and biomedical signals like EEG and ECG.

Feature selection

In machine learning, you can treat feature selection as a sparse recovery problem. Given a feature matrix AA and target vector bb, solving the L1-minimization problem identifies which features (columns of AA) are most relevant for predicting bb.

This is closely related to LASSO regression (minbAx22+λx1\min \|b - Ax\|_2^2 + \lambda\|x\|_1). Applications include gene expression analysis (selecting relevant genes from thousands of candidates), text classification, and sensor placement optimization.

Solving basis pursuit problems

Since basis pursuit is a convex program, several families of efficient algorithms can solve it. The right choice depends on problem scale, required accuracy, and available computational resources.

Convex optimization

Basis pursuit can be cast as a linear program by introducing auxiliary variables:

min1Tusubject toAx=b,  uxu\min \mathbf{1}^T u \quad \text{subject to} \quad Ax = b, \; -u \leq x \leq u

This means any LP solver works. For moderate problem sizes, interior-point methods are the gold standard: they converge in O(n)O(\sqrt{n}) iterations, each requiring the solution of a linear system. They're accurate but memory-intensive for very large problems.

Proximal gradient methods are another option, particularly for the unconstrained LASSO/BPDN formulations. They scale better to large problems since each iteration is cheap (a matrix-vector multiply plus a soft-thresholding step).

Sparse signal representation, Frontiers | An Intelligent EEG Classification Methodology Based on Sparse Representation ...

Iterative shrinkage-thresholding algorithms

ISTA is a workhorse for L1-regularized problems. Each iteration has two steps:

  1. Gradient step: Take a step in the direction of the negative gradient of the smooth part of the objective (the data fidelity term).
  2. Soft-thresholding: Apply the proximal operator of the L1-norm, which shrinks each coefficient toward zero and sets small coefficients exactly to zero.

The soft-thresholding operator is:

softλ(xi)=sign(xi)max(xiλ,0)\text{soft}_\lambda(x_i) = \text{sign}(x_i) \max(|x_i| - \lambda, 0)

ISTA converges at a rate of O(1/k)O(1/k) where kk is the iteration count. FISTA (Fast ISTA) adds a momentum/acceleration step inspired by Nesterov's method, improving convergence to O(1/k2)O(1/k^2). This acceleration is significant in practice.

Homotopy methods

Homotopy methods exploit the fact that the LASSO solution path is piecewise linear as the regularization parameter λ\lambda varies. Instead of solving for a single λ\lambda, these methods trace the entire solution path efficiently.

The algorithm works by:

  1. Starting at a large λ\lambda where the solution is all zeros.
  2. Decreasing λ\lambda and tracking which coefficients enter or leave the active set (become nonzero or return to zero).
  3. At each breakpoint, updating the solution using a simple linear algebra step.

LARS (Least Angle Regression) is the most well-known homotopy method. It computes the full regularization path at roughly the cost of a single least-squares solve, making it very efficient when you need solutions across multiple sparsity levels.

Basis pursuit extensions

The standard basis pursuit formulation assumes exact measurements and no structural constraints on the coefficients. Real problems often violate these assumptions, motivating several extensions.

Basis pursuit denoising

Basis Pursuit Denoising (BPDN) handles noisy measurements b=Ax+eb = Ax + e where e2ϵ\|e\|_2 \leq \epsilon. The formulation becomes:

minx1subject toAxb2ϵ\min \|x\|_1 \quad \text{subject to} \quad \|Ax - b\|_2 \leq \epsilon

An equivalent Lagrangian form (which is the LASSO) is:

min12Axb22+λx1\min \frac{1}{2}\|Ax - b\|_2^2 + \lambda\|x\|_1

The parameter λ\lambda (or equivalently ϵ\epsilon) controls the trade-off between sparsity and data fidelity. Choosing λ\lambda well is critical: too large and you over-regularize (losing signal), too small and you fit the noise.

Basis pursuit with non-negativity constraints

Some applications require all coefficients to be non-negative (e.g., spectral unmixing, where mixing proportions can't be negative). The formulation adds x0x \geq 0:

minx1subject toAx=b,  x0\min \|x\|_1 \quad \text{subject to} \quad Ax = b, \; x \geq 0

With non-negativity, the L1-norm simplifies to 1Tx\mathbf{1}^T x (since all entries are non-negative, absolute values are unnecessary). This can actually make the problem easier to solve. Specialized algorithms like non-negative ISTA and active-set methods handle this efficiently.

Group sparse basis pursuit

Standard basis pursuit treats each coefficient independently. Group sparsity assumes coefficients come in predefined groups, and entire groups tend to be active or inactive together.

The mixed 1/2\ell_1/\ell_2 norm replaces the L1-norm:

gGxg2\sum_{g \in \mathcal{G}} \|x_g\|_2

where xgx_g is the subvector of coefficients in group gg. This penalizes the number of active groups (via the outer sum) while allowing flexibility within each group (via the L2-norm).

Applications include multi-task learning (where tasks share features in groups), source localization (where spatial sources activate clusters of sensors), and structured variable selection.

Performance guarantees

Theoretical guarantees tell you when basis pursuit will succeed. These results connect properties of the measurement matrix AA and the signal's sparsity level to recovery accuracy.

Restricted isometry property

The Restricted Isometry Property (RIP) of order ss states that for all ss-sparse vectors xx:

(1δs)x22Ax22(1+δs)x22(1 - \delta_s)\|x\|_2^2 \leq \|Ax\|_2^2 \leq (1 + \delta_s)\|x\|_2^2

where δs[0,1)\delta_s \in [0, 1) is the restricted isometry constant. A small δs\delta_s means AA approximately preserves distances between sparse vectors, so distinct sparse signals produce distinct measurements.

Random matrices (Gaussian, Bernoulli, subsampled Fourier) satisfy the RIP with high probability when m=O(slog(n/s))m = O(s \log(n/s)). Verifying the RIP for a specific matrix is itself NP-hard, but the probabilistic guarantees for random matrices are strong enough for practical use.

Exact recovery conditions

The main exact recovery result: if AA satisfies the RIP of order 2s2s with δ2s<210.4142\delta_{2s} < \sqrt{2} - 1 \approx 0.4142, then basis pursuit exactly recovers every ss-sparse signal.

An alternative sufficient condition is the nullspace property: basis pursuit recovers all ss-sparse signals if and only if for every nonzero vector hh in the nullspace of AA, the L1-norm of hh restricted to any set of ss indices is strictly less than the L1-norm on the remaining indices. The nullspace property is a necessary and sufficient condition, while the RIP is only sufficient (but easier to verify probabilistically).

Stability and robustness

Real signals are rarely exactly sparse, and measurements always contain some noise. Performance guarantees extend to these realistic settings:

  • Stability (noisy measurements): If b=Ax0+eb = Ax_0 + e with e2ϵ\|e\|_2 \leq \epsilon, then the BPDN solution x^\hat{x} satisfies x^x02Cϵ\|\hat{x} - x_0\|_2 \leq C \cdot \epsilon for a constant CC depending on δ2s\delta_{2s}.
  • Robustness (approximately sparse signals): If x0x_0 is not exactly ss-sparse, the recovery error is bounded by the best ss-term approximation error σs(x0)1=minz0sx0z1\sigma_s(x_0)_1 = \min_{\|z\|_0 \leq s} \|x_0 - z\|_1.

These results together give the celebrated instance-optimal recovery guarantee, meaning basis pursuit performs nearly as well as an oracle that knows the support of the sparse signal.

Computational complexity

Practical deployment of basis pursuit requires understanding the computational costs involved, especially as problem dimensions grow.

NP-hardness of L0-norm minimization

The "ideal" sparse recovery problem minimizes the L0-norm (number of nonzero entries):

minx0subject toAx=b\min \|x\|_0 \quad \text{subject to} \quad Ax = b

This is NP-hard, meaning no known algorithm solves it in polynomial time for general instances. Even approximating the L0 solution within certain factors is NP-hard. This computational barrier is the primary motivation for using the L1-norm relaxation instead.

Polynomial-time solvability of L1-norm minimization

Because L1 minimization is a convex problem (and can be reformulated as an LP), it's solvable in polynomial time. Specific complexity bounds depend on the algorithm:

  • Interior-point methods: O(n3)O(n^3) per iteration, O(n)O(\sqrt{n}) iterations, giving roughly O(n3.5)O(n^{3.5}) total for an nn-dimensional problem.
  • First-order methods (ISTA/FISTA): Each iteration costs O(mn)O(mn) (one matrix-vector multiply). Convergence to ϵ\epsilon-accuracy takes O(1/ϵ)O(1/\epsilon) iterations for ISTA and O(1/ϵ)O(1/\sqrt{\epsilon}) for FISTA.
  • Homotopy/LARS: Roughly O(mns)O(mns) where ss is the sparsity level, making it very efficient for highly sparse solutions.

Large-scale optimization techniques

For problems with millions of variables, standard methods become impractical. Scalable approaches include:

  • Coordinate descent: Updates one variable at a time, exploiting sparsity to skip zero coefficients. Very efficient for LASSO-type problems.
  • Stochastic gradient methods: Process random subsets of measurements per iteration, reducing per-iteration cost from O(mn)O(mn) to O(mn)O(m'n) where mmm' \ll m.
  • Operator splitting methods (ADMM, split Bregman): Decompose the problem into simpler subproblems that can be solved in parallel, well-suited for distributed computing.
  • Screening rules: Identify variables guaranteed to be zero at the optimum before solving, effectively reducing problem dimension.

These techniques have enabled basis pursuit to scale to real-world problems with millions of variables and constraints, such as large-scale MRI reconstruction and hyperspectral imaging.