๐Ÿ“ŠCausal Inference

Key Concepts of Quasi-Experimental Designs

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When randomized controlled trials aren't possible due to ethical constraints, cost, or practical limitations, quasi-experimental designs become your primary toolkit for establishing causality. You need to identify when each design is appropriate, what assumptions must hold, and how threats to validity differ across methods. These techniques are the workhorses behind policy evaluation, program assessment, and empirical research in economics, public health, and the social sciences.

Causal inference is fundamentally about ruling out alternative explanations. Each method addresses confounding in a different way: some exploit timing, others leverage cutoffs, and still others rely on external variation. Don't just memorize definitions. Know what identifying assumption each design requires and what would cause it to fail.


Designs That Exploit Timing

These methods leverage the structure of when interventions occur to separate causal effects from pre-existing trends. If you can observe the same units (or comparable units) before and after treatment, you can difference out confounding factors that don't change over time.

Difference-in-Differences (DiD)

DiD compares changes over time between a treatment group and a control group. You're not comparing levels at a single point; you're comparing the change in each group across periods. The treatment effect is the difference between those two changes.

  • Parallel trends assumption: In the absence of treatment, both groups would have followed the same trajectory. This is the critical identifying assumption, and it's what makes or breaks a DiD analysis.
  • Policy evaluation standard: DiD is ideal for assessing interventions like minimum wage changes or policy rollouts where randomization is impossible. Card and Krueger's 1994 minimum wage study is a classic example, comparing employment changes in New Jersey (which raised its minimum wage) to neighboring Pennsylvania.
  • Two-way fixed effects: In practice, DiD is often implemented with unit and time fixed effects in a regression framework, especially when treatment rolls out to different units at different times (staggered adoption). Recent econometrics literature has shown that staggered DiD with two-way fixed effects can produce biased estimates when treatment effects vary over time, so be aware of newer estimators (Callaway & Sant'Anna, Sun & Abraham) designed to address this.

Interrupted Time Series (ITS)

ITS analyzes trends across many time points before and after an intervention, looking for changes in the level or slope of the outcome series at the moment of intervention.

  • Single-group design: ITS doesn't require a control group, which makes it useful when no comparison group exists. You're essentially using the pre-intervention trend as the counterfactual.
  • Data demands: You need a sufficient number of pre- and post-intervention observations (often recommended at least 8 time points on each side) to model the trend reliably.
  • History threat: Any other event occurring at the same time as the intervention can bias results. This is the primary threat to validity, since there's no control group to help rule out concurrent changes.

Compare: DiD vs. ITS: both exploit timing, but DiD requires a control group to difference out time trends while ITS relies on extrapolating pre-intervention trends. If you have a strong comparison group, DiD is preferred. If you have rich time-series data but no credible control group, ITS is your fallback.


Designs That Exploit Thresholds and Cutoffs

These methods identify causal effects by comparing units just above and below an arbitrary threshold. The logic: units very near the cutoff are essentially randomly assigned to treatment, creating local randomization.

Regression Discontinuity Design (RDD)

RDD exploits a cutoff rule where treatment is assigned based on whether a running variable (test score, age, income) falls above or below a threshold.

  • Sharp vs. fuzzy: In a sharp RDD, crossing the cutoff perfectly determines treatment (everyone above gets treated, everyone below doesn't). In a fuzzy RDD, crossing the cutoff only changes the probability of treatment, and you use the cutoff as an instrument. Fuzzy RDD is essentially an IV strategy applied at the threshold.
  • Local average treatment effect (LATE): Estimates are valid only for units near the cutoff, which limits external validity. You can't generalize the effect to units far from the threshold.
  • Manipulation threat: If units can precisely control their running variable to sort around the cutoff, the design fails. Always test for bunching (a density test, such as the McCrary test) to check whether units are piling up on one side.
  • Bandwidth choice: Results can be sensitive to how wide a window you use around the cutoff. Narrower bandwidths increase internal validity but reduce statistical power. Report results across multiple bandwidths to show robustness.

Instrumental Variables (IV)

IV uses an instrument, a source of external variation that affects treatment assignment but has no direct effect on the outcome.

  • Two key conditions: The instrument must be (1) relevant (correlated with treatment, testable via the first-stage F-statistic) and (2) exogenous, meaning it satisfies the exclusion restriction (it affects the outcome only through its effect on treatment, not directly). The exclusion restriction cannot be directly tested and must be argued on substantive grounds.
  • Addresses endogeneity: IV solves problems of reverse causality and omitted variable bias when valid instruments exist. The classic example is using quarter of birth as an instrument for years of schooling (Angrist & Krueger, 1991), though even this instrument has been debated.
  • Weak instruments problem: If the instrument is only weakly correlated with treatment, IV estimates become unreliable and biased toward OLS. A first-stage F-statistic well above 10 (some recent guidance suggests above ~100 for certain tests) is the standard diagnostic.
  • LATE interpretation: With a heterogeneous treatment effect, IV estimates the effect for compliers only, the subpopulation whose treatment status is actually changed by the instrument. This is another form of local average treatment effect.

Compare: RDD vs. IV: both provide causal estimates without randomization, but RDD exploits a known assignment rule while IV exploits external variation. RDD gives you a clear visual test (plot the discontinuity); IV validity is harder to verify since the exclusion restriction cannot be directly tested. Note that fuzzy RDD is actually a special case of IV, where the instrument is the indicator for being above the cutoff.


Designs That Construct Comparison Groups

When natural comparison groups don't exist, these methods create them statistically. The goal is to balance observable characteristics between treated and control units to approximate what randomization would achieve.

Propensity Score Matching (PSM)

PSM matches treated and control units based on their propensity score, the estimated probability of receiving treatment given observed covariates. This collapses a high-dimensional covariate space into a single number.

  • Selection on observables (conditional independence): PSM assumes that after conditioning on the propensity score, treatment assignment is independent of potential outcomes. Put differently, there are no unobserved confounders. This is a strong and untestable assumption.
  • Common support requirement: Matching only works where treated and control units have overlapping propensity score distributions. If treated units have propensity scores of 0.7โ€“0.9 but controls top out at 0.5, you can't match those treated units. Always check for overlap and trim or discard units outside the region of common support.
  • Balance checking: After matching, verify that covariates are balanced between groups using standardized mean differences. If balance isn't achieved, the propensity score model may be misspecified.

Matching Methods (General)

Beyond propensity scores, several matching techniques pair treated and control units directly on covariates.

  • Nearest neighbor matching finds the closest control unit(s) for each treated unit based on a distance metric. Caliper matching imposes a maximum allowable distance to prevent poor matches. Exact matching requires identical values on specified covariates, which works well with discrete variables but becomes infeasible with many continuous ones.
  • Coarsened exact matching (CEM) is a newer approach that temporarily coarsens continuous variables into bins and then matches exactly within those bins, offering a good balance between feasibility and match quality.
  • Fundamental limitation: Like PSM, all matching methods only control for observed characteristics. Hidden confounders remain a threat. This is the key difference from methods like IV or DiD, which can address certain types of unobserved confounding.

Synthetic Control Method

The synthetic control method constructs a weighted combination of untreated units to create a counterfactual that matches the treated unit's pre-treatment trajectory.

  • Ideal for single-unit studies: When only one state, country, or organization receives treatment, traditional methods with large samples don't apply. Synthetic control fills this gap. Abadie, Diamond, and Hainmueller's (2010) study of California's tobacco control program is the canonical example.
  • Donor pool quality: Control units must be unaffected by the treatment (no spillovers) and similar enough to the treated unit to form a credible synthetic match. A poor donor pool produces a poor counterfactual.
  • Inference through placebo tests: Since you have only one treated unit, standard statistical inference doesn't apply. Instead, you run the same analysis on each control unit as if it were treated (in-space placebos) and compare the treated unit's effect to this distribution of placebo effects.

Compare: PSM vs. Synthetic Control: both construct comparison groups, but PSM matches individual units one-to-one (or one-to-many) while synthetic control creates a weighted composite of multiple units. Use PSM when you have many treated and control units. Use synthetic control when you're studying a single case, like one state's policy change.


Designs That Exploit Natural Variation

These approaches leverage real-world events that create quasi-random variation in treatment exposure. Nature or policy inadvertently runs the experiment for you.

Natural Experiments

A natural experiment occurs when some exogenous shock, like a lottery, a natural disaster, or an unexpected policy change, creates variation in treatment that is plausibly unrelated to potential outcomes.

  • "As-if random" variation: The claim is that the event affected some groups but not others in a way that mimics random assignment. The Vietnam draft lottery is a textbook example: draft eligibility was determined by birthday, creating random variation in military service.
  • Credibility depends on context: You must argue convincingly that the variation is unrelated to potential outcomes. This is often contested, and the strength of a natural experiment hinges on how persuasive this argument is.
  • Foundation for other methods: Natural experiments often provide the instruments for IV analyses or the treatment variation for DiD designs. The natural experiment is the source of variation; the econometric method is how you formalize the analysis.

Fixed Effects Models

Fixed effects models control for time-invariant unobserved confounders by examining only within-unit variation over time. If a confounder doesn't change across periods (like a person's innate ability or a country's geography), fixed effects eliminate it.

  • Panel data requirement: You need repeated observations of the same units (individuals, firms, countries) across multiple time periods.
  • Entity and time fixed effects: Entity (unit) fixed effects absorb all stable differences between units. Time fixed effects absorb shocks common to all units in a given period. Including both is standard in many applications.
  • Key limitation: Fixed effects cannot address time-varying confounders, factors that change over time and correlate with both treatment and outcome. For example, if a state adopts a new policy and experiences an economic boom simultaneously, entity fixed effects won't separate those two influences.

Compare: Natural Experiments vs. Fixed Effects: natural experiments identify causal effects through external variation, while fixed effects control for stable confounders through within-unit comparisons. Natural experiments are about finding plausibly exogenous variation; fixed effects are about removing a specific class of unobservables. They're often used together: a natural experiment provides the variation, and fixed effects clean up remaining confounding.


Qualitative and Mixed Approaches

Not all causal inference is quantitative. These methods provide depth and context that statistical approaches may miss.

Comparative Case Studies

Comparative case studies involve in-depth analysis of a small number of cases to understand how and why causal processes operate, not just whether effects exist.

  • Process tracing: This technique follows the causal chain step-by-step, identifying mechanisms and ruling out alternative explanations by examining the sequence of events within a case.
  • Most-similar and most-different designs: Selecting cases that are alike in many respects but differ on the treatment (most-similar) or that differ widely but share the outcome (most-different) helps isolate the causal factor of interest.
  • Hypothesis generation: Case studies are particularly valuable early in research when theory is underdeveloped. Findings can motivate subsequent quantitative work by identifying plausible mechanisms to test at scale.

Compare: Comparative Case Studies vs. Quantitative Quasi-Experiments: case studies prioritize mechanistic understanding within specific contexts, while quantitative methods prioritize estimating average effects across populations. Use case studies to understand why an effect occurs; use quantitative methods to estimate how large it is. The strongest research programs often combine both.


Quick Reference Table

ConceptBest Examples
Exploits timing/trendsDiD, Interrupted Time Series
Exploits cutoffs/thresholdsRDD, IV
Constructs comparison groupsPSM, Matching Methods, Synthetic Control
Controls for unobservablesFixed Effects, IV, DiD
Single-unit or small-N studiesSynthetic Control, Comparative Case Studies
Requires parallel trendsDiD
Requires valid instrumentIV
Selection on observables onlyPSM, Matching Methods

Self-Check Questions

  1. Both DiD and ITS exploit timing to identify causal effects. What is the key difference in their data requirements, and when would you choose one over the other?

  2. A researcher wants to estimate the effect of a scholarship program awarded to students scoring above 80 on an entrance exam. Which design is most appropriate, and what threat to validity should they test for?

  3. Compare propensity score matching and instrumental variables: which assumption is stronger, and why might a researcher prefer IV despite its stricter requirements?

  4. You're asked to evaluate a smoking ban implemented in one state. You have data on multiple states over 10 years. Which two methods could you use, and what are the tradeoffs between them?

  5. FRQ-style: A policy analyst claims that fixed effects models "solve" the problem of omitted variable bias. Explain why this claim is only partially correct, identifying what types of confounders fixed effects can and cannot address.