The propensity score is the probability that a subject receives treatment, conditional on their observed baseline characteristics. In notation, for subject $i$ with covariates $X_i$ :

$e(X_i) = P(T_i = 1 \mid X_i)$

In a randomized experiment, every subject has a known probability of being assigned to treatment. Observational studies lack this design, so the propensity score estimates that probability from the data. The core idea comes from Rosenbaum and Rubin (1983): if two subjects have the same propensity score, their observed covariates are, on average, balanced across treatment groups. This means you can condition on a single scalar (the propensity score) instead of trying to match or adjust on many covariates simultaneously.

Propensity scores only balance on observed covariates. They cannot fix confounding from variables you didn't measure.

Estimating propensity scores

Logistic regression for propensity score estimation

The most common approach is logistic regression with treatment assignment as the outcome and observed covariates as predictors:

$\log\left(\frac{e(X_i)}{1 - e(X_i)}\right) = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki}$

The predicted probabilities from this model are your estimated propensity scores. Logistic regression is straightforward to implement and lets you inspect how each covariate relates to treatment assignment through the estimated coefficients.

One thing to keep in mind: the goal here is prediction of treatment assignment, not inference about individual coefficients. You're not testing whether a covariate "significantly" predicts treatment. You just want the best possible estimate of each subject's treatment probability.

Machine learning methods for propensity score estimation

Methods like random forests, gradient boosting (e.g., GBM, XGBoost), and other flexible algorithms can also estimate propensity scores. These approaches:

Capture non-linear relationships and interactions automatically
Can handle high-dimensional covariate spaces
May outperform logistic regression when the true relationship between covariates and treatment is complex

The tradeoff is reduced interpretability and the risk of overfitting, which can produce extreme propensity scores near 0 or 1. Regardless of the estimation method, what ultimately matters is whether the resulting scores achieve good covariate balance.

Assessing propensity score balance

Estimating propensity scores is not the end goal. You need to verify that conditioning on them actually balances covariates between groups.

Visual assessment of propensity score balance

Plot the distribution of estimated propensity scores for treated and control groups using histograms, density plots, or box plots
Look for overlap (common support): both groups should have propensity scores across a similar range. Regions where one group has scores but the other doesn't indicate positivity violations
After applying your propensity score method (matching, weighting, etc.), re-plot covariate distributions to confirm they've become more similar

Statistical assessment of propensity score balance

The standardized mean difference (SMD) is the preferred balance metric. For a covariate $X$ :

$SMD = \frac{\bar{X}_{treated} - \bar{X}_{control}}{\sqrt{(s^2_{treated} + s^2_{control})/2}}$

Calculate SMDs before and after propensity score adjustment to quantify improvement. A common rule of thumb is that SMDs below 0.1 indicate adequate balance, though this threshold isn't absolute.

Avoid relying on p-values from t-tests or chi-square tests for balance checking. These tests are sensitive to sample size: with large samples, trivial imbalances appear "significant," while with small samples, real imbalances can be missed. SMDs are sample-size-independent and more informative.

Logistic regression for propensity score estimation, Exploring propensity score matching and weighting

Propensity score methods

Once you have estimated propensity scores, there are four main ways to use them.

Matching on propensity scores

Match each treated subject to one or more control subjects with similar propensity scores. Common approaches:

Nearest neighbor matching: pair each treated subject with the control subject whose propensity score is closest
Caliper matching: same as nearest neighbor, but only allow matches within a specified distance (e.g., 0.2 standard deviations of the logit propensity score)
Optimal matching: minimize the total distance across all matched pairs simultaneously

After matching, you estimate the treatment effect by comparing outcomes between matched treated and control subjects. Matching can discard unmatched subjects, which reduces sample size but typically targets the ATT (average treatment effect on the treated).

Stratification on propensity scores

Divide subjects into strata (often 5 to 10) based on propensity score quantiles. Within each stratum, treated and control subjects should have roughly similar covariate distributions. You then:

Estimate the treatment effect within each stratum
Combine stratum-specific estimates into an overall effect (typically a weighted average, with weights proportional to stratum size)

Stratification retains all subjects, but balance within strata is only approximate. Using more strata improves balance but reduces the sample size per stratum.

Inverse probability weighting (IPW)

IPW reweights subjects to create a pseudo-population where treatment is independent of observed covariates. The weights are:

Treated subjects: $w_i = \frac{1}{e(X_i)}$
Control subjects: $w_i = \frac{1}{1 - e(X_i)}$

These weights target the ATE (average treatment effect). If you want the ATT instead, use weights of 1 for treated subjects and $e(X_i)/(1 - e(X_i))$ for controls.

IPW retains all subjects and is computationally simple, but it's sensitive to extreme propensity scores. A subject with $e(X_i) = 0.01$ gets a weight of 100, which can make estimates unstable. Stabilized weights (multiplying by the marginal probability of treatment) and weight trimming help address this.

Propensity score as a covariate

Include the estimated propensity score directly as a covariate in a regression model for the outcome. This is the simplest approach, but it relies heavily on correctly specifying the outcome model and tends to be less effective at reducing bias than matching or weighting. It's generally considered the weakest of the four methods.

Advantages vs. disadvantages of propensity score methods

Advantages:

Reduce a high-dimensional covariate set to a single balancing score

Allow transparent, pre-outcome assessment of covariate balance (you can check balance before ever looking at the outcome)

Separate the "design" phase (achieving balance) from the "analysis" phase (estimating the effect), reducing reliance on outcome model assumptions

Disadvantages:

Only balance on observed covariates; unmeasured confounding remains a threat

Require correct specification of the propensity score model

Can reduce effective sample size (matching may discard subjects; extreme weights in IPW inflate variance)

Logistic regression for propensity score estimation, Machine Learning and causal inference

Propensity scores vs. regression adjustment

Regression adjustment models the outcome directly as a function of treatment and covariates. Propensity score methods instead model the treatment assignment mechanism. Key differences:

Propensity scores let you assess balance before analyzing outcomes. With regression, you're adjusting and estimating simultaneously, so you can't easily tell whether the adjustment "worked."
Propensity score methods can be more robust when the outcome model is misspecified, since they don't require you to get the outcome-covariate relationship right.
Regression adjustment can be more statistically efficient when the outcome model is correctly specified, because it uses information about the outcome-covariate relationship.
The two approaches can be combined ("doubly robust" estimation): if either the propensity score model or the outcome model is correct, the treatment effect estimate is consistent.

Propensity scores with multiple treatments

When there are more than two treatment levels (e.g., three different drugs), you estimate a generalized propensity score: the probability of receiving each specific treatment level given observed covariates. For treatment level $t$ :

$e_t(X_i) = P(T_i = t \mid X_i)$

These can be estimated using multinomial logistic regression or a series of binary models. Generalized propensity scores support matching, stratification, or weighting across multiple treatment groups and enable pairwise treatment comparisons or dose-response estimation.

Propensity scores with time-varying treatments

When treatment changes over time (e.g., a patient starts and stops medication across multiple visits), standard propensity scores don't suffice. Instead, you estimate time-varying propensity scores at each time point, conditional on the covariate and treatment history up to that point.

These are most commonly used with inverse probability of treatment weighting (IPTW) to handle time-varying confounding, especially when past treatment affects future confounders. This connects to marginal structural models, where the weights create a pseudo-population free of time-varying confounding.

Propensity score methods in practice

Selecting variables for propensity score estimation

Variable selection matters a lot. The guiding principles:

Include all variables that are confounders (associated with both treatment and outcome)
Include variables strongly associated with the outcome, even if weakly related to treatment. These improve precision without introducing bias.
Exclude variables that are consequences of treatment (post-treatment variables), as conditioning on them can introduce bias
Exclude instrumental variables (strongly predict treatment but have no direct effect on outcome), as including them can increase variance without reducing bias
Use causal diagrams (DAGs) and subject matter knowledge to guide these decisions

Specifying the propensity score model

Start with a simple model using main effects of all selected covariates
Estimate propensity scores and check covariate balance
If balance is inadequate, add interaction terms, quadratic terms, or other non-linear transformations
Re-check balance after each modification
Use cross-validation if employing machine learning methods to avoid overfitting

The iterative nature of this process is a feature, not a bug. You keep refining until balance is satisfactory.

Checking propensity score overlap

Overlap (also called positivity or common support) means that for every combination of covariate values, there's a non-zero probability of receiving either treatment or control. In practice:

Plot propensity score distributions for both groups and look for regions where they overlap
If some subjects have propensity scores very near 0 or 1, they have near-certain treatment assignment, making causal comparisons unreliable in that region
Consider trimming subjects with extreme scores (e.g., dropping those with scores below 0.05 or above 0.95) to improve overlap and reduce sensitivity to model misspecification

Sensitivity analysis for propensity score methods

Because propensity scores can't address unmeasured confounding, sensitivity analysis is critical:

Compare results across different propensity score methods (matching, stratification, IPW) to see if conclusions are robust
Vary the propensity score model specification (different covariates, functional forms) and check stability of estimates
Use formal sensitivity analysis frameworks (e.g., Rosenbaum bounds) to quantify how strong unmeasured confounding would need to be to change your conclusions
Report the range of estimates under different assumptions to honestly convey uncertainty