Overview of inverse probability weighting
Inverse probability weighting (IPW) is a technique for estimating causal effects from observational data. The core problem it solves: in observational studies, the people who get treated differ systematically from those who don't, so a naive comparison of outcomes is confounded. IPW fixes this by reweighting observations so that the confounders are balanced across treatment groups, effectively creating a pseudo-population that mimics what you'd see in a randomized experiment.
The weight assigned to each observation is the inverse of the probability that the person received the treatment they actually got, given their covariates. If a treated person had a low probability of being treated (based on their characteristics), they get a large weight because they represent many similar people who were not treated. This upweighting of "unlikely" observations is what drives the balance.
Motivation for the weighting approach
In a randomized experiment, treatment is assigned independently of confounders, so you can compare outcomes directly. Observational data doesn't give you that luxury. Confounders influence who gets treated, which biases simple comparisons.
IPW tackles this by asking: given someone's covariates, how likely were they to receive the treatment they got? Then it weights each person inversely by that probability. The result is a weighted sample where the distribution of confounders looks roughly the same in the treated and untreated groups. You're not removing confounded individuals from the data; you're re-balancing the entire sample through weights.
Assumptions of IPW
Positivity assumption
Every individual, regardless of their covariate values, must have a non-zero probability of receiving each treatment level. Formally:
If some region of the covariate space has zero (or near-zero) probability of a particular treatment, the corresponding weights blow up. This is called a positivity violation, and it leads to extreme weights and unstable estimates. Practical near-violations (propensity scores very close to 0 or 1) cause similar problems even when strict positivity technically holds.
Exchangeability assumption
Also called the no unmeasured confounders assumption. It requires that, conditional on the observed covariates , treatment assignment is independent of the potential outcomes:
This means every variable that jointly affects both treatment and outcome must be measured and included in the propensity score model. If an unmeasured confounder exists, the weights won't balance it, and your causal estimate will be biased. This assumption is untestable from the data alone, which is why subject-matter knowledge is so important when specifying the model.
Estimating weights
Propensity score models
The propensity score is the probability of receiving treatment given covariates:
For binary treatments, this is most commonly estimated with logistic regression, where treatment status is the outcome and the confounders are predictors. Variable selection matters a lot here. You should include true confounders (variables that affect both treatment and outcome). Omitting an important confounder introduces bias; including variables that only predict the outcome (not treatment) can actually improve efficiency, but including instruments (variables that affect treatment but not outcome) can increase variance.
Stabilized vs. unstabilized weights
Unstabilized weights for individual :
A treated person gets weight , and an untreated person gets weight . These weights can be highly variable, especially when propensity scores are near 0 or 1.
Stabilized weights replace the numerator of 1 with the marginal probability of the treatment actually received:
Stabilized weights have the same expectation for balancing confounders but lower variance. They also produce a pseudo-population closer in size to the original sample. In practice, stabilized weights are preferred because they yield more efficient and stable estimates.
Fitting outcome models
Weighted regression
Once you have the weights, you fit a regression model for the outcome with each observation weighted by its IPW. For continuous outcomes, use weighted least squares. For binary outcomes, use weighted logistic regression.
The treatment variable should be included as a predictor. If residual imbalance remains after weighting (checked via balance diagnostics), you may also include additional covariates in the outcome model, though the primary confounding adjustment comes from the weights.

Estimating causal effects
The coefficient on the treatment variable in the weighted model estimates the average treatment effect (ATE) in the pseudo-population. For a binary treatment, this is the difference in expected outcomes between treated and untreated groups after reweighting.
Standard errors from a naive weighted regression are typically too small because they ignore the uncertainty in estimating the weights. Use robust (sandwich) standard errors to get valid confidence intervals. Bootstrapping is another option that accounts for the full estimation process, including the propensity score step.
Assessing covariate balance
Standardized mean differences
After applying weights, you need to verify that confounders are actually balanced. The standardized mean difference (SMD) for each covariate compares the weighted means between treatment groups, scaled by the pooled standard deviation.
- SMD close to 0 indicates good balance.
- The conventional threshold is SMD < 0.1 for adequate balance.
- Check SMDs for every confounder, not just a few. A Love plot (SMDs before and after weighting, plotted side by side) is a useful visual tool.
Variance ratios
Balancing means isn't enough; you should also check that the distributions have similar spread. The variance ratio (VR) divides the weighted variance of a covariate in the treated group by its weighted variance in the untreated group.
- VR close to 1 indicates good balance.
- Values far from 1 suggest the weighting hasn't adequately balanced higher-order features of the distribution.
If substantial imbalances persist, revisit the propensity score model. Consider adding interaction terms, nonlinear terms, or using a more flexible estimation method (e.g., generalized boosted models).
Comparison to other methods
IPW vs. matching
Both IPW and matching use propensity scores to address confounding, but they do so differently. Matching pairs (or groups) treated and untreated individuals with similar propensity scores and discards unmatched units. IPW keeps all observations and reweights them instead.
- Matching is often more intuitive to explain and can be easier to diagnose visually.
- IPW retains the full sample, which can improve efficiency.
- Matching results can be sensitive to the choice of algorithm (nearest-neighbor, caliper width, with/without replacement).
- IPW results can be sensitive to extreme propensity scores.
IPW vs. stratification
Stratification divides the sample into subgroups (often quintiles) based on propensity scores and estimates treatment effects within each stratum. The overall ATE is a weighted average across strata.
- Stratification is more robust to propensity score model misspecification because it only requires approximate ranking, not exact probability estimates.
- However, stratification can leave residual confounding within strata if the strata are too coarse (e.g., using only 5 strata for a complex confounder structure).
- IPW generally provides more precise estimates when the propensity score model is well-specified.
Limitations of IPW
Sensitivity to model misspecification
IPW's validity depends on the propensity score model being correctly specified. If you omit a confounder or get the functional form wrong (e.g., assuming linearity when the true relationship is nonlinear), the weights won't properly balance the confounders, and your causal estimate will be biased.
Sensitivity analyses can help. Try varying the set of included confounders, adding interaction or polynomial terms, or using flexible machine learning methods for propensity score estimation. If results change substantially, that's a warning sign.

Instability with extreme weights
When propensity scores are near 0 or 1, the resulting weights become very large, and a small number of observations can dominate the analysis. This inflates variance and makes estimates unreliable.
Common remedies:
- Weight truncation: cap weights at a chosen percentile (e.g., the 99th percentile) so no single observation has outsized influence.
- Weight trimming: exclude observations with propensity scores below or above certain thresholds (e.g., drop anyone with or ).
- Stabilized weights: as discussed above, these naturally reduce weight variability.
Both truncation and trimming introduce some bias because they effectively change the target population. Use them judiciously and report the sensitivity of your results to the chosen threshold.
Applications of IPW
Time-varying treatments
IPW extends naturally to settings where treatment changes over time. At each time point, you estimate the probability of the treatment actually received, conditional on the full covariate and treatment history up to that point. The weight for each individual is the product of the inverse probabilities across all time points.
This is where IPW really shines relative to standard regression, because time-varying confounders that are also affected by prior treatment create a problem that conventional adjustment can't handle without introducing bias.
Marginal structural models
Marginal structural models (MSMs) are the modeling framework most commonly paired with IPW for time-varying treatments. An MSM specifies the marginal mean of the potential outcome as a function of the treatment history, without conditioning on the time-varying confounders.
The key steps:
- Estimate the probability of treatment at each time point, given covariate and treatment history.
- Construct stabilized weights as the cumulative product of time-point-specific weights.
- Fit a weighted outcome model (e.g., weighted GEE or weighted pooled logistic regression) with the treatment history as predictors.
MSMs also typically incorporate inverse probability of censoring weights to handle loss to follow-up, multiplying treatment weights by censoring weights.
Simulation studies of IPW
Evaluating bias reduction
Simulation studies let you test IPW under controlled conditions where the true causal effect is known. You generate data with a specified causal structure, apply IPW, and compare the estimated effect to the truth.
This approach is valuable for understanding how IPW performs when assumptions are met versus violated. For example, you can simulate data with an unmeasured confounder and quantify how much bias remains, or introduce near-positivity violations and observe the resulting instability.
Comparing to other estimators
Simulations also allow head-to-head comparisons of IPW against matching, stratification, g-computation, and doubly robust methods. The usual metrics are bias, variance, mean squared error (MSE), and confidence interval coverage.
A consistent finding across many simulation studies: IPW performs well when the propensity score model is correct but can be outperformed by doubly robust methods when there's a risk of model misspecification. No single method dominates in all scenarios, which is why understanding the tradeoffs matters.
Extensions of IPW
Doubly robust estimation
Doubly robust (DR) estimators combine a propensity score model with an outcome model. The key property: the estimate is consistent if either model is correctly specified (though not necessarily both). This gives you two chances to get it right.
A common DR estimator is the augmented IPW (AIPW) estimator, which starts with an IPW estimate and adds a correction term based on the outcome model. If the propensity score model is wrong but the outcome model is right, the correction term removes the bias, and vice versa. DR estimators also tend to be more efficient than IPW alone when both models are at least approximately correct.
Targeted maximum likelihood estimation
Targeted maximum likelihood estimation (TMLE) is another doubly robust approach with some additional advantages. The procedure works in three stages:
- Fit an initial outcome model (e.g., using machine learning methods like Super Learner).
- Update the initial model using a "clever covariate" derived from the propensity score. This targeting step optimizes the estimate specifically for the causal parameter of interest.
- Average the targeted predictions to obtain the ATE estimate.
TMLE is doubly robust and, under regularity conditions, achieves the semiparametric efficiency bound. It also respects the bounds of the outcome (e.g., predicted probabilities stay between 0 and 1), which AIPW does not always guarantee. These properties make TMLE a strong default choice for causal inference in observational studies.