Definition of propensity scores
The propensity score is the probability that a subject receives treatment, conditional on their observed baseline characteristics. In notation, for subject with covariates :
In a randomized experiment, every subject has a known probability of being assigned to treatment. Observational studies lack this design, so the propensity score estimates that probability from the data. The core idea comes from Rosenbaum and Rubin (1983): if two subjects have the same propensity score, their observed covariates are, on average, balanced across treatment groups. This means you can condition on a single scalar (the propensity score) instead of trying to match or adjust on many covariates simultaneously.
Propensity scores only balance on observed covariates. They cannot fix confounding from variables you didn't measure.
Estimating propensity scores
Logistic regression for propensity score estimation
The most common approach is logistic regression with treatment assignment as the outcome and observed covariates as predictors:
The predicted probabilities from this model are your estimated propensity scores. Logistic regression is straightforward to implement and lets you inspect how each covariate relates to treatment assignment through the estimated coefficients.
One thing to keep in mind: the goal here is prediction of treatment assignment, not inference about individual coefficients. You're not testing whether a covariate "significantly" predicts treatment. You just want the best possible estimate of each subject's treatment probability.
Machine learning methods for propensity score estimation
Methods like random forests, gradient boosting (e.g., GBM, XGBoost), and other flexible algorithms can also estimate propensity scores. These approaches:
- Capture non-linear relationships and interactions automatically
- Can handle high-dimensional covariate spaces
- May outperform logistic regression when the true relationship between covariates and treatment is complex
The tradeoff is reduced interpretability and the risk of overfitting, which can produce extreme propensity scores near 0 or 1. Regardless of the estimation method, what ultimately matters is whether the resulting scores achieve good covariate balance.
Assessing propensity score balance
Estimating propensity scores is not the end goal. You need to verify that conditioning on them actually balances covariates between groups.
Visual assessment of propensity score balance
- Plot the distribution of estimated propensity scores for treated and control groups using histograms, density plots, or box plots
- Look for overlap (common support): both groups should have propensity scores across a similar range. Regions where one group has scores but the other doesn't indicate positivity violations
- After applying your propensity score method (matching, weighting, etc.), re-plot covariate distributions to confirm they've become more similar
Statistical assessment of propensity score balance
The standardized mean difference (SMD) is the preferred balance metric. For a covariate :
Calculate SMDs before and after propensity score adjustment to quantify improvement. A common rule of thumb is that SMDs below 0.1 indicate adequate balance, though this threshold isn't absolute.
Avoid relying on p-values from t-tests or chi-square tests for balance checking. These tests are sensitive to sample size: with large samples, trivial imbalances appear "significant," while with small samples, real imbalances can be missed. SMDs are sample-size-independent and more informative.

Propensity score methods
Once you have estimated propensity scores, there are four main ways to use them.
Matching on propensity scores
Match each treated subject to one or more control subjects with similar propensity scores. Common approaches:
- Nearest neighbor matching: pair each treated subject with the control subject whose propensity score is closest
- Caliper matching: same as nearest neighbor, but only allow matches within a specified distance (e.g., 0.2 standard deviations of the logit propensity score)
- Optimal matching: minimize the total distance across all matched pairs simultaneously
After matching, you estimate the treatment effect by comparing outcomes between matched treated and control subjects. Matching can discard unmatched subjects, which reduces sample size but typically targets the ATT (average treatment effect on the treated).
Stratification on propensity scores
Divide subjects into strata (often 5 to 10) based on propensity score quantiles. Within each stratum, treated and control subjects should have roughly similar covariate distributions. You then:
- Estimate the treatment effect within each stratum
- Combine stratum-specific estimates into an overall effect (typically a weighted average, with weights proportional to stratum size)
Stratification retains all subjects, but balance within strata is only approximate. Using more strata improves balance but reduces the sample size per stratum.
Inverse probability weighting (IPW)
IPW reweights subjects to create a pseudo-population where treatment is independent of observed covariates. The weights are:
- Treated subjects:
- Control subjects:
These weights target the ATE (average treatment effect). If you want the ATT instead, use weights of 1 for treated subjects and for controls.
IPW retains all subjects and is computationally simple, but it's sensitive to extreme propensity scores. A subject with gets a weight of 100, which can make estimates unstable. Stabilized weights (multiplying by the marginal probability of treatment) and weight trimming help address this.
Propensity score as a covariate
Include the estimated propensity score directly as a covariate in a regression model for the outcome. This is the simplest approach, but it relies heavily on correctly specifying the outcome model and tends to be less effective at reducing bias than matching or weighting. It's generally considered the weakest of the four methods.
Advantages vs. disadvantages of propensity score methods
Advantages:
- Reduce a high-dimensional covariate set to a single balancing score
- Allow transparent, pre-outcome assessment of covariate balance (you can check balance before ever looking at the outcome)
- Separate the "design" phase (achieving balance) from the "analysis" phase (estimating the effect), reducing reliance on outcome model assumptions
Disadvantages:
- Only balance on observed covariates; unmeasured confounding remains a threat
- Require correct specification of the propensity score model
- Can reduce effective sample size (matching may discard subjects; extreme weights in IPW inflate variance)

Propensity scores vs. regression adjustment
Regression adjustment models the outcome directly as a function of treatment and covariates. Propensity score methods instead model the treatment assignment mechanism. Key differences:
- Propensity scores let you assess balance before analyzing outcomes. With regression, you're adjusting and estimating simultaneously, so you can't easily tell whether the adjustment "worked."
- Propensity score methods can be more robust when the outcome model is misspecified, since they don't require you to get the outcome-covariate relationship right.
- Regression adjustment can be more statistically efficient when the outcome model is correctly specified, because it uses information about the outcome-covariate relationship.
- The two approaches can be combined ("doubly robust" estimation): if either the propensity score model or the outcome model is correct, the treatment effect estimate is consistent.
Propensity scores with multiple treatments
When there are more than two treatment levels (e.g., three different drugs), you estimate a generalized propensity score: the probability of receiving each specific treatment level given observed covariates. For treatment level :
These can be estimated using multinomial logistic regression or a series of binary models. Generalized propensity scores support matching, stratification, or weighting across multiple treatment groups and enable pairwise treatment comparisons or dose-response estimation.
Propensity scores with time-varying treatments
When treatment changes over time (e.g., a patient starts and stops medication across multiple visits), standard propensity scores don't suffice. Instead, you estimate time-varying propensity scores at each time point, conditional on the covariate and treatment history up to that point.
These are most commonly used with inverse probability of treatment weighting (IPTW) to handle time-varying confounding, especially when past treatment affects future confounders. This connects to marginal structural models, where the weights create a pseudo-population free of time-varying confounding.
Propensity score methods in practice
Selecting variables for propensity score estimation
Variable selection matters a lot. The guiding principles:
- Include all variables that are confounders (associated with both treatment and outcome)
- Include variables strongly associated with the outcome, even if weakly related to treatment. These improve precision without introducing bias.
- Exclude variables that are consequences of treatment (post-treatment variables), as conditioning on them can introduce bias
- Exclude instrumental variables (strongly predict treatment but have no direct effect on outcome), as including them can increase variance without reducing bias
- Use causal diagrams (DAGs) and subject matter knowledge to guide these decisions
Specifying the propensity score model
- Start with a simple model using main effects of all selected covariates
- Estimate propensity scores and check covariate balance
- If balance is inadequate, add interaction terms, quadratic terms, or other non-linear transformations
- Re-check balance after each modification
- Use cross-validation if employing machine learning methods to avoid overfitting
The iterative nature of this process is a feature, not a bug. You keep refining until balance is satisfactory.
Checking propensity score overlap
Overlap (also called positivity or common support) means that for every combination of covariate values, there's a non-zero probability of receiving either treatment or control. In practice:
- Plot propensity score distributions for both groups and look for regions where they overlap
- If some subjects have propensity scores very near 0 or 1, they have near-certain treatment assignment, making causal comparisons unreliable in that region
- Consider trimming subjects with extreme scores (e.g., dropping those with scores below 0.05 or above 0.95) to improve overlap and reduce sensitivity to model misspecification
Sensitivity analysis for propensity score methods
Because propensity scores can't address unmeasured confounding, sensitivity analysis is critical:
- Compare results across different propensity score methods (matching, stratification, IPW) to see if conclusions are robust
- Vary the propensity score model specification (different covariates, functional forms) and check stability of estimates
- Use formal sensitivity analysis frameworks (e.g., Rosenbaum bounds) to quantify how strong unmeasured confounding would need to be to change your conclusions
- Report the range of estimates under different assumptions to honestly convey uncertainty