Fiveable

📊Causal Inference Unit 7 Review

QR code for Causal Inference practice questions

7.2 Bandwidth selection and local polynomial regression

7.2 Bandwidth selection and local polynomial regression

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
📊Causal Inference
Unit & Topic Study Guides

Bandwidth in local polynomial regression

Bandwidth selection controls how much data goes into each local fit in a local polynomial regression. Pick too narrow a window and your estimates jump around with noise; pick too wide and you blur over the very discontinuity you're trying to detect. In regression discontinuity (RD) designs, getting this right directly determines whether your treatment effect estimate is credible.

Importance of bandwidth selection

The bandwidth hh defines the size of the neighborhood around each point where data contributes to the local polynomial fit. Observations within the bandwidth receive positive kernel weight; observations outside receive zero (or near-zero) weight.

This matters because:

  • Too few observations in the window means high variance — your estimate is noisy and unstable across samples.
  • Too many observations means you're averaging over a wide range of the running variable, which pulls in data far from the cutoff and introduces bias.
  • In RD specifically, the treatment effect estimate is only valid at the cutoff, so you need enough data near the cutoff to be precise, but not so much that you're fitting the wrong functional form.

Bias-variance tradeoff

The core tension in bandwidth choice is between bias and variance, and the goal is to minimize mean squared error (MSE), which combines both:

MSE=Bias2+VarianceMSE = Bias^2 + Variance

  • Smaller hh: Uses fewer observations near the target point. The local polynomial closely tracks the true function (low bias), but the estimate is noisy (high variance).
  • Larger hh: Uses more observations, stabilizing the estimate (low variance), but the polynomial must approximate the regression function over a wider range, which introduces systematic error (high bias).

The optimal bandwidth is the value of hh that minimizes MSE. It shrinks as sample size grows — with more data, you can afford a narrower window and still keep variance under control.

Smaller vs. larger bandwidths

Think of this as a spectrum:

  • Undersmoothing (small hh): The fitted curve is wiggly and responsive to local features. Useful when the regression function has sharp turns or local structure, but sensitive to outliers and noise.
  • Oversmoothing (large hh): The fitted curve is smooth and stable but may miss genuine local patterns — including, in RD, the jump at the cutoff itself.

In practice, you'll often see researchers report results across a range of bandwidths to show that the treatment effect estimate isn't driven by one particular choice. This is a key part of RD robustness checks.

Methods for bandwidth selection

No single method dominates in all settings. The three main approaches each handle the bias-variance tradeoff differently.

Rule-of-thumb approach

This is the simplest option. It assumes the regression function is smooth and the errors are roughly normal, then derives a closed-form bandwidth:

h=cσn1/5h = c \cdot \sigma \cdot n^{-1/5}

where cc is a constant (depending on the kernel), σ\sigma is the standard deviation of the residuals, and nn is the sample size.

The n1/5n^{-1/5} rate means the bandwidth shrinks slowly as you get more data. This method is fast to compute and gives a reasonable starting point, but it can perform poorly when the smoothness assumption is wrong — for instance, when the regression function has sharp local features near the cutoff.

Cross-validation techniques

Cross-validation lets the data choose the bandwidth without strong assumptions about the regression function.

  1. Choose a set of candidate bandwidths.
  2. For each candidate hh, fit the local polynomial using all observations except one (leave-one-out) or except one fold (k-fold).
  3. Predict the held-out observation(s) and compute the prediction error.
  4. Repeat for every observation or fold.
  5. Select the hh that minimizes the average prediction error (typically MSE).

Leave-one-out cross-validation (LOOCV) uses every observation as its own validation set, which is thorough but computationally expensive for large datasets. K-fold cross-validation splits data into kk groups and is faster but introduces some randomness from the fold assignment.

Cross-validation is data-driven and flexible, but it optimizes global fit rather than estimation at the cutoff specifically. For RD applications, this distinction matters.

Importance of bandwidth selection, Machine Learning and causal inference

Plug-in methods

Plug-in methods work directly with the asymptotic mean integrated squared error (AMISE) formula, which expresses the optimal bandwidth in terms of unknown quantities — typically the second derivative of the regression function and the noise variance.

  1. Estimate the unknown quantities (e.g., fit a preliminary regression to estimate curvature).
  2. Plug those estimates into the AMISE formula.
  3. Solve for the bandwidth that minimizes the estimated AMISE.

Silverman's rule of thumb is a simple plug-in method. The direct plug-in (DPI) method is more sophisticated, iteratively refining the preliminary estimates. These methods are faster than cross-validation but only as good as the preliminary estimates they rely on. If the pilot estimate of curvature is poor, the resulting bandwidth will be too.

For RD designs specifically, the most widely used approach is the method of Imbens and Kalyanaraman (2012) and its refinement by Calonico, Cattaneo, and Titiunik (CCT, 2014). CCT's procedure selects an MSE-optimal bandwidth and then applies a bias correction with robust confidence intervals. If you're doing applied RD work, the rdrobust package (available in R and Stata) implements this directly.

Local polynomial regression

Local polynomial regression fits a polynomial of order pp in a neighborhood around each target point, weighting observations by a kernel function. Unlike global polynomial regression (which fits one polynomial to all the data), local polynomial regression adapts to the shape of the regression function at each location.

Advantages over kernel regression

Standard kernel regression (the Nadaraya-Watson estimator) fits a constant locally — it's essentially a weighted average. Local polynomial regression improves on this in several ways:

  • Captures curvature: A local linear or quadratic fit approximates the slope and curvature of the true function, not just its level.
  • Reduces boundary bias: Kernel regression is biased at the edges of the data because the kernel is asymmetric there. Local polynomials naturally adapt because the polynomial extrapolates into the sparse region.
  • Better theoretical properties: Local polynomial estimators achieve faster convergence rates and lower minimax MSE compared to kernel estimators of the same order.

These advantages are especially relevant in RD, where you're estimating the regression function right at the cutoff — a boundary point by construction.

Fitting local polynomials

To estimate the regression function at a point xx:

  1. Assign weights to each observation ii using a kernel function: K(Xixh)K\left(\frac{X_i - x}{h}\right), where hh is the bandwidth.

  2. Construct the local design matrix with polynomial terms: 1,(Xix),(Xix)2,,(Xix)p1, (X_i - x), (X_i - x)^2, \ldots, (X_i - x)^p.

  3. Solve the weighted least squares problem — minimize i=1nK(Xixh)[Yiβ0β1(Xix)βp(Xix)p]2\sum_{i=1}^{n} K\left(\frac{X_i - x}{h}\right) \left[Y_i - \beta_0 - \beta_1(X_i - x) - \cdots - \beta_p(X_i - x)^p\right]^2.

  4. The estimate at xx is m^(x)=β^0\hat{m}(x) = \hat{\beta}_0 — the intercept of the local fit.

The resulting estimator can be written as a linear smoother:

m^(x)=i=1nWi(x)Yi\hat{m}(x) = \sum_{i=1}^{n} W_i(x) Y_i

where Wi(x)W_i(x) are the equivalent kernel weights that come out of the weighted least squares solution.

Order of polynomial fit

The polynomial order pp controls how flexible each local fit is:

  • p=0p = 0 (local constant): Equivalent to kernel regression. Simple but prone to boundary bias.
  • p=1p = 1 (local linear): The workhorse choice in RD designs. It handles boundary bias well and is relatively stable.
  • p=2p = 2 (local quadratic): Captures curvature better but introduces more variance, especially with small samples.

Higher-order polynomials can approximate more complex local shapes, but they also increase variance and can produce erratic behavior near the boundaries. In most applied RD work, local linear regression is the default because it strikes a good balance: it corrects for boundary bias at the cutoff without overfitting.

A useful rule: the polynomial order and the bandwidth interact. If you increase pp, you typically need a larger bandwidth to keep variance manageable.

Boundary bias in local polynomial regression

In RD designs, the cutoff is literally a boundary — you estimate the regression function separately on each side, so every estimate at the cutoff is a one-sided boundary estimate. This makes boundary bias a first-order concern, not just a technical nuisance.

Importance of bandwidth selection, Local Polynomial Regression Estimator of the Finite Population Total under Stratified Random ...

Causes of boundary bias

Boundary bias occurs because the kernel window extends beyond the range of available data on one side:

  • At the cutoff in an RD design, observations exist only to the left (control) or only to the right (treated) of the point you're estimating.
  • The kernel is effectively truncated, making the weighted average asymmetric.
  • This asymmetry means the estimator systematically over- or under-estimates the true function value at the boundary.

The severity depends on the bandwidth (wider bandwidths worsen it), the kernel shape, and the curvature of the true regression function near the boundary.

Techniques for bias correction

Several strategies reduce boundary bias:

  • Boundary kernels: Modified kernel functions (e.g., Epanechnikov boundary kernel, Gamma boundary kernel) that adjust their shape near boundaries to restore symmetry in the weighting. They work but require knowing where the boundaries are and can be tricky to implement.
  • Local linear regression: Automatically adapts at boundaries because the linear fit extrapolates naturally. This is the most common practical solution in RD.
  • Explicit bias correction: The CCT (2014) procedure estimates the leading bias term (which depends on the second derivative of the regression function) and subtracts it from the point estimate. The confidence intervals are then constructed to account for the remaining uncertainty from this correction.
  • Resampling methods: Jackknife and bootstrap procedures can estimate and correct for boundary bias, though they add computational cost.

Boundary kernels vs. local linear regression

Both approaches address boundary bias, but they work differently:

  • Boundary kernels fix the problem by reshaping the weights so the effective kernel is symmetric even at the boundary. They require explicit construction for each boundary region.
  • Local linear regression fixes the problem implicitly — the linear term in the local fit absorbs the first-order bias that causes boundary distortion.

In practice, local linear regression is far more common in RD applications because it handles boundary bias as a built-in feature rather than requiring a separate correction. Boundary kernels are more common in density estimation and other nonparametric settings.

Applications in causal inference

Local polynomial regression is not just a curve-fitting tool — it's central to several causal inference strategies.

Estimating treatment effects

In observational studies, local polynomial regression can estimate treatment effects nonparametrically:

  • Fit separate local polynomial regressions for treated and control groups as functions of covariates.
  • The difference in fitted values at each covariate value estimates the conditional average treatment effect (CATE).
  • Averaging over the appropriate population gives the ATE or ATT.

This approach avoids imposing a parametric functional form on the outcome-covariate relationship, reducing the risk of specification bias.

Regression discontinuity designs

This is where bandwidth selection and local polynomial regression come together most directly. In a sharp RD design:

  1. Choose a bandwidth hh around the cutoff cc (using, e.g., the CCT optimal bandwidth selector).

  2. Restrict the sample to observations with running variable values in [ch,c+h][c - h, c + h].

  3. Fit a local polynomial (typically local linear) separately on each side of the cutoff.

  4. The treatment effect estimate is the difference in the two fitted values at the cutoff: τ^=m^+(c)m^(c)\hat{\tau} = \hat{m}_+(c) - \hat{m}_-(c).

The bandwidth determines how much data informs the estimate. Robustness checks typically vary the bandwidth (e.g., half and double the optimal) and the polynomial order to confirm the result isn't sensitive to these choices.

Generalized propensity score estimation

For continuous or multi-valued treatments, the generalized propensity score (GPS) extends the standard propensity score framework. Local polynomial regression can estimate the GPS by modeling the conditional density of treatment given covariates.

Once estimated, the GPS is used to:

  • Balance covariates across treatment levels.
  • Estimate dose-response functions showing how outcomes vary with treatment intensity.

Local polynomial regression is well-suited here because the relationship between treatment assignment and covariates may be nonlinear, and a flexible estimator avoids the misspecification risk of parametric models.