Score-based algorithms estimate treatment effects by balancing observed covariates between treatment and control groups. They aim to mimic a randomized controlled trial by creating pseudo-populations where treatment assignment is independent of observed covariates. The three main approaches are propensity score matching, inverse probability weighting, and doubly robust estimation.

Each method uses a different strategy to handle confounding, but they all rely on the same core assumption: that you've measured all the relevant confounders. Understanding when and how to apply each one, and how to check whether it's working, is what this section covers.

Propensity score matching

Propensity score calculation

The propensity score is the probability of receiving treatment given the observed covariates:

$e(X) = P(T=1 \mid X)$

You typically estimate this with logistic regression, where treatment assignment is the dependent variable and the covariates are predictors. The key idea is dimensionality reduction: instead of trying to match on dozens of covariates simultaneously, you collapse them into a single number (the propensity score) and match on that.

Once you have estimated propensity scores, you pair treated and control units that have similar scores. The goal is a matched sample where the distribution of observed covariates looks roughly the same across both groups.

Matching methods

Nearest neighbor matching pairs each treated unit with the control unit whose propensity score is closest (1:1 matching). Simple and widely used, but can produce poor matches when the nearest neighbor is still far away.
Caliper matching adds a constraint: matched pairs must have propensity scores within a specified maximum distance (the caliper width). This prevents bad matches at the cost of potentially discarding some treated units.
Matching with replacement allows a single control unit to serve as the match for multiple treated units. This improves match quality, especially when good controls are scarce, but means some control units carry more weight in the analysis.
Optimal matching minimizes the total distance across all matched pairs simultaneously, rather than greedily pairing one unit at a time. It tends to produce better overall balance than nearest neighbor matching.

Advantages vs limitations

Propensity score matching is intuitive and produces a matched dataset that's easy to analyze and interpret. It can effectively reduce confounding bias when the relevant covariates are measured.

The downsides are real, though. Matching often discards unmatched units, shrinking your sample and reducing statistical power. If there's limited overlap in propensity scores between groups (poor common support), you may lose a large fraction of your data. Most critically, matching cannot fix bias from unmeasured confounders. The entire approach rests on the no unmeasured confounding assumption.

Inverse probability weighting

Propensity score weighting

Inverse probability weighting (IPW) takes a different approach than matching. Instead of pairing units, it reweights the entire sample so that the observed covariate distributions are balanced across treatment groups.

Each unit receives a weight based on the inverse of its probability of receiving the treatment it actually got:

$w_i = \frac{T_i}{e(X_i)} + \frac{1 - T_i}{1 - e(X_i)}$

The intuition: units that received an unlikely treatment assignment (e.g., a treated unit with a low propensity score) get upweighted because they carry more information about what would have happened under the other condition. The result is a weighted pseudo-population where covariates are balanced.

Stabilized vs unstabilized weights

Unstabilized weights can become very large when propensity scores are close to 0 or 1. A single unit with a propensity score of 0.01 gets a weight of 100, which inflates variance and makes estimates unstable.

Stabilized weights address this by multiplying by the marginal probability of treatment:

$sw_i = \frac{P(T_i)}{e(X_i)} \cdot T_i + \frac{1 - P(T_i)}{1 - e(X_i)} \cdot (1 - T_i)$

Stabilized weights have a mean closer to 1 and lower variance, which translates to more efficient and stable estimates. In practice, you should almost always prefer stabilized weights.

Marginal structural models

Marginal structural models (MSMs) extend IPW to settings with time-varying treatments and time-varying confounders. This is the scenario where standard regression adjustment breaks down: if a confounder at time $t$ is affected by treatment at time $t-1$ , conditioning on it introduces bias, but ignoring it also introduces bias.

MSMs solve this by modeling the counterfactual outcome as a function of the full treatment history, using IPW to adjust for the time-varying confounding at each time point. This makes them especially useful in longitudinal studies (e.g., estimating the effect of sustained medication use over time).

Doubly robust estimation

Outcome regression

Outcome regression directly models the relationship between covariates and the outcome, separately for treated and control groups. You fit a model (linear regression for continuous outcomes, logistic for binary), then use it to predict each unit's potential outcomes under both treatment conditions.

The treatment effect estimate comes from comparing these predicted potential outcomes. The weakness: if your outcome model is misspecified, your estimates will be biased.

Propensity score modeling

In the doubly robust framework, you also estimate propensity scores (typically via logistic regression), just as you would for IPW. These scores are used to construct weights or to adjust the outcome model for the treatment assignment mechanism.

Combining outcome and propensity models

The doubly robust estimator combines both models. The critical property is this: the estimator is consistent if either the outcome model or the propensity score model is correctly specified. You don't need both to be right, just one.

This "double protection" is why the method is attractive. In practice, of course, having both models approximately correct yields the best performance. Doubly robust estimators also tend to be more efficient than IPW alone when the outcome model is reasonably well-specified.

Assessing covariate balance

After applying any score-based method, you need to verify that it actually worked. Balance checking is not optional.

Standardized mean differences

The standardized mean difference (SMD) for a covariate is the difference in group means divided by the pooled standard deviation. SMDs are scale-free, so you can compare balance across covariates measured in different units.

Rules of thumb:

SMD < 0.1: good balance
SMD between 0.1 and 0.2: borderline
SMD > 0.2: meaningful imbalance that likely needs to be addressed

Always check SMDs after matching or weighting, not just before.

Variance ratios

Variance ratios compare the spread of a covariate across groups. A ratio near 1 means the variability is similar; ratios far from 1 indicate that one group has a much wider or narrower distribution. This matters most for continuous covariates, where two groups could have identical means but very different distributions.

Graphical diagnostics

Plots like side-by-side boxplots, overlaid density curves, or quantile-quantile plots let you visually inspect the full distribution of each covariate across groups. These are especially valuable for catching non-linear imbalances (e.g., different distributional shapes) that summary statistics like SMDs would miss.

Sensitivity analysis

Unobserved confounding

Every score-based method assumes no unmeasured confounding. Since this assumption is untestable, sensitivity analysis asks: how strong would an unobserved confounder need to be to change our conclusions?

If the answer is "implausibly strong," you can be more confident in your results. If even a modest confounder could flip the findings, the evidence is fragile.

Propensity score calculation, Impact of breast surgery on survival of patients with stage IV breast cancer: a SEER population ...

Rosenbaum bounds

Rosenbaum bounds are the standard sensitivity analysis tool for matched studies. The approach works by introducing a parameter $\Gamma$ that represents the degree of hidden bias. At $\Gamma = 1$ , there's no hidden bias. As $\Gamma$ increases, you're allowing for stronger unmeasured confounding.

You then ask: at what value of $\Gamma$ does the treatment effect become statistically insignificant? A study that remains significant up to $\Gamma = 3$ is more robust than one that breaks down at $\Gamma = 1.2$ .

Simulation-based approaches

An alternative is to directly simulate hypothetical unobserved confounders. You specify a plausible confounder's relationship with both treatment and outcome, then re-estimate the treatment effect including this simulated variable. By varying the confounder's strength and prevalence, you map out how sensitive your results are across different confounding scenarios.

This approach is more flexible than Rosenbaum bounds and can be applied to any score-based method, not just matched designs.

Extensions and variations

Matching with replacement

Allowing control units to be reused across multiple matches improves match quality when good controls are scarce. The trade-off is that your effective sample size shrinks (since some controls are counted multiple times), and you need to use variance estimators that account for the non-independence this creates.

Caliper matching

The caliper width controls the bias-variance trade-off. A narrow caliper ensures high-quality matches but may discard many units, reducing power and potentially limiting generalizability. A common starting point is a caliper of 0.2 standard deviations of the logit of the propensity score, though this should be tuned based on the data.

Coarsened exact matching

Coarsened exact matching (CEM) takes a fundamentally different approach. Instead of estimating propensity scores, you coarsen each covariate into bins (e.g., age groups instead of exact ages), then require exact matches on the coarsened values.

CEM guarantees balance on the coarsened covariates by construction, handles both continuous and categorical variables, and avoids model dependence in the matching step. The downside is that fine-grained coarsening with many covariates can leave few or no exact matches, especially in smaller samples.

Practical considerations

Sample size and overlap

Score-based methods need adequate sample size and sufficient overlap (common support) in propensity score distributions between groups. Check overlap by plotting the propensity score distributions for treated and control units. If there are regions where one group has scores but the other doesn't, units in those regions can't be reliably compared.

When overlap is poor, consider trimming the sample to the region of common support, or switching to methods like CEM that handle limited overlap differently.

Missing data handling

Missing covariate data complicates propensity score estimation. Your approach should depend on the missing data mechanism:

Missing completely at random (MCAR): Complete case analysis is unbiased but wasteful.
Missing at random (MAR): Multiple imputation is generally preferred. Estimate propensity scores within each imputed dataset, then combine results using Rubin's rules.
Missing not at random (MNAR): No standard fix exists. Sensitivity analysis is essential.

Using missing data indicators as covariates is sometimes done in practice, but this approach can introduce bias and should be used cautiously.

Software implementations

Several well-maintained packages support these methods:

R: MatchIt (matching), WeightIt (weighting), twang (generalized boosted models for propensity scores), CBPS (covariate balancing propensity scores)
Stata: teffects (built-in treatment effects), psmatch2 (propensity score matching), ipw (inverse probability weighting)
Python: DoWhy (causal inference framework), causalinference (propensity score methods)

Each package has different defaults and capabilities, so check the documentation for your specific use case rather than relying on default settings.