Randomized experiments are the gold standard for causal inference because they let you isolate the effect of a treatment from everything else. By randomly assigning units to treatment and control groups, you make those groups comparable on average across every characteristic, whether you measured it or not.

A well-designed experiment requires four things: a clearly defined population of interest, a precisely specified treatment, an adequate sample size, and a valid randomization procedure for assigning units to conditions.

Benefits of Randomization

Why does randomization matter so much? Because it balances both observed and unobserved confounders between groups on average. No observational method can do that.

Creates comparable groups, enabling unbiased estimation of causal effects
Provides a basis for statistical inference: you can quantify the uncertainty around your treatment effect estimates using known probability distributions
Strengthens the credibility of findings by minimizing bias and confounding, which is why randomized experiments carry more weight than observational studies in most policy and scientific debates

Estimating Causal Effects

Average Treatment Effect (ATE)

The average treatment effect (ATE) is the most common estimand in a randomized experiment. It captures the average difference in outcomes if everyone were treated versus if no one were treated:

$ATE = E[Y(1)] - E[Y(0)]$

Here, $Y(1)$ and $Y(0)$ are the potential outcomes under treatment and control. You never observe both for the same unit (the fundamental problem of causal inference), but randomization lets you estimate the ATE by comparing group averages.

You can estimate the ATE using a simple difference in sample means or through regression analysis, both covered below.

Intention-to-Treat (ITT) Analysis

Intention-to-treat (ITT) analysis estimates the effect of being assigned to treatment, regardless of whether units actually received or complied with it.

Preserves the benefits of randomization because you analyze everyone according to their original assignment
Yields a conservative estimate of the treatment effect, since noncompliers dilute the measured impact
Reflects real-world effectiveness: in practice, not everyone follows through on an assigned treatment, so the ITT tells you what to expect from a policy of offering the treatment

Validity in Randomized Experiments

Internal vs. External Validity

These two concepts address different questions about your experiment's conclusions.

Internal validity asks: Did the treatment actually cause the observed outcome difference within this study? Randomization directly strengthens internal validity by eliminating confounding.

External validity asks: Do these findings generalize to other populations, settings, or times? A perfectly randomized experiment can still have limited external validity if the sample is unrepresentative or the study conditions are artificial. For example, a drug trial conducted only on young, healthy volunteers may not generalize to elderly patients with comorbidities.

Analyzing Completely Randomized Designs

Difference in Means

In a completely randomized design, units are assigned to treatment or control without any stratification or matching. The simplest estimator for the ATE is the difference in sample means:

$\hat{ATE} = \bar{Y}_1 - \bar{Y}_0$

This estimator is unbiased under successful randomization. You construct confidence intervals and hypothesis tests using the standard error of the difference, which depends on the variance within each group and the sample sizes.

Regression Analysis

Regression offers a more flexible way to estimate the ATE. The basic model is:

$Y_i = \beta_0 + \beta_1 T_i + \epsilon_i$

Here $T_i$ is a binary treatment indicator (1 = treated, 0 = control), and $\beta_1$ is the estimated ATE. Without covariates, this gives you exactly the same answer as the difference in means.

Why bother with regression, then? Two reasons:

Precision gains. Including pre-treatment covariates (e.g., baseline scores) reduces residual variance, which shrinks your standard errors and tightens confidence intervals.
Heterogeneous effects. Adding interaction terms between $T_i$ and covariates lets you estimate how the treatment effect varies across subgroups.

Average treatment effect (ATE), Randomised Trials

Analyzing Stratified and Matched Designs

Conditional Average Treatment Effect

Stratified and matched designs group units into strata or pairs based on observed covariates before randomization occurs within each group. This guarantees balance on those covariates.

The conditional average treatment effect (CATE) is the average treatment effect within a specific stratum. You estimate it by computing the difference in means (or running a regression) within each stratum separately.

To get the overall ATE, take a weighted average of the CATEs across strata, where the weights are the proportion of units in each stratum.

Regression with Strata Indicators

You can handle stratification in a single regression by including strata fixed effects:

$Y_i = \beta_0 + \beta_1 T_i + \sum_{s=1}^{S-1} \gamma_s I_{is} + \epsilon_i$

The $I_{is}$ terms are indicator variables for each stratum $s$ , and $\beta_1$ gives you the pooled treatment effect across strata (assuming a constant effect). If you want to allow the treatment effect to differ by stratum, interact $T_i$ with the strata indicators.

Noncompliance in Experiments

One-Sided vs. Two-Sided Noncompliance

Noncompliance happens when units don't follow their assigned treatment condition.

One-sided noncompliance: Some units assigned to treatment don't take it, but no one in the control group receives the treatment. Common in trials where the treatment is optional (e.g., a training program people can skip).
Two-sided noncompliance: Units in both groups deviate. Some assigned to treatment don't take it, and some assigned to control obtain the treatment on their own.

Noncompliance is a problem because if the people who comply differ systematically from those who don't, a naive comparison of actual treatment recipients vs. non-recipients is confounded.

Complier Average Causal Effect (CACE)

The CACE (also called the local average treatment effect, or LATE) targets the treatment effect specifically among compliers: units who would take the treatment if assigned to it and would not take it if assigned to control.

$CACE = E[Y(1) - Y(0) \mid D(1) = 1, D(0) = 0]$

Here $D(1)$ and $D(0)$ are potential treatment-receipt indicators under assignment to treatment and control, respectively. Estimating the CACE requires two key assumptions beyond random assignment:

Exclusion restriction: Assignment affects outcomes only through its effect on actual treatment receipt.
Monotonicity: No unit does the opposite of their assignment in both directions (i.e., no "defiers" who would take treatment only when assigned to control).

Instrumental Variables Estimation

Instrumental variables (IV) estimation uses the randomized assignment as an instrument to recover the CACE. The estimator is straightforward:

$\hat{CACE} = \frac{\hat{ITT}_Y}{\hat{ITT}_D}$

The numerator is the ITT effect on the outcome (reduced-form effect), and the denominator is the ITT effect on treatment receipt (the first-stage effect, i.e., the compliance rate difference between groups).

This works because dividing by the compliance rate "scales up" the diluted ITT to reflect the effect among those who actually responded to the assignment. The required assumptions are: random assignment, exclusion restriction, monotonicity, and SUTVA (stable unit treatment value assumption, meaning one unit's outcome isn't affected by another's assignment).

Dealing with Missing Data

Inverse Probability Weighting

Inverse probability weighting (IPW) handles missing outcome data by upweighting observed units that resemble the missing ones.

Estimate the probability of being observed for each unit, typically using logistic regression with treatment assignment and baseline covariates as predictors.
Assign each observed unit a weight equal to the inverse of its estimated probability of being observed.
Compute the weighted difference in means (or weighted regression) using these weights.

The intuition: if a certain type of unit is underrepresented among the observed data because similar units tend to drop out, IPW gives those remaining units more influence to compensate. This produces unbiased estimates as long as the model for the observation probability is correctly specified.

Average treatment effect (ATE), Frontiers | Use of Mendelian Randomization for Identifying Risk Factors for Brain Tumors

Multiple Imputation

Multiple imputation (MI) takes a different approach: instead of reweighting, it fills in the missing values.

Fit a model to the observed data and draw multiple plausible values for each missing observation from the model's posterior predictive distribution.
Create several complete datasets (typically 5-20), each with different imputed values.
Analyze each dataset separately using your standard method.
Combine the results across datasets using Rubin's rules, which pool the point estimates and adjust the standard errors to reflect both within-imputation and between-imputation variability.

MI assumes data are missing at random (MAR): the probability of missingness depends only on observed variables, not on the missing values themselves. This is a weaker assumption than "missing completely at random" but still requires careful thought about what drives attrition.

Subgroup Analysis and Heterogeneity

Interaction Terms in Regression

To test whether a treatment works differently for different types of people, add an interaction term to your regression:

$Y_i = \beta_0 + \beta_1 T_i + \beta_2 X_i + \beta_3 (T_i \times X_i) + \epsilon_i$

Here $X_i$ is a subgroup indicator (e.g., male vs. female), and $\beta_3$ captures the difference in treatment effects between subgroups. A statistically significant $\beta_3$ suggests treatment effect heterogeneity: the treatment doesn't affect everyone equally.

The treatment effect for the baseline subgroup ( $X_i = 0$ ) is $\beta_1$ , and for the other subgroup ( $X_i = 1$ ) it's $\beta_1 + \beta_3$ .

Dangers of Post-Hoc Subgroup Analysis

Subgroup analyses that weren't planned before data collection are risky:

Multiple comparisons problem. Testing many subgroups inflates the chance of finding a "significant" difference by chance. With 20 subgroups tested at $\alpha = 0.05$ , you'd expect about one false positive even if the treatment effect is truly constant.
Overfitting to noise. Post-hoc subgroups are often suggested by patterns in the data, which means you're partly "discovering" random variation.
Lower credibility. Reviewers and policymakers rightly treat post-hoc findings as exploratory, not confirmatory.

Pre-specified subgroup analyses (registered before data collection) are far more credible. Post-hoc findings should be replicated in independent studies before being taken seriously.

Statistical Power and Sample Size

Minimum Detectable Effect Size

Statistical power is the probability that your study will detect a true treatment effect when one exists. The minimum detectable effect size (MDES) is the smallest effect your study can reliably detect at a given significance level and power.

The MDES depends on:

Sample size (larger $n$ → smaller MDES)
Outcome variance (less noisy outcomes → smaller MDES)
Significance level ( $\alpha$ , the Type I error rate)
Desired power (typically 0.80 or higher; $1 - \beta$ , where $\beta$ is the Type II error rate)

Factors Influencing Power

Several design choices affect your power:

Increase sample size. The most direct way to boost power.
Reduce outcome variance. Including pre-treatment covariates in your analysis absorbs some outcome variability, effectively increasing power without adding participants.
Use stratified or matched designs. These reduce the variance of the treatment effect estimator compared to a completely randomized design, improving power.
Accept a higher significance level. Moving from $\alpha = 0.01$ to $\alpha = 0.05$ increases power but also increases the false positive rate, so this is a tradeoff.

Practical Considerations

Randomized experiments must follow core ethical principles:

Respect for persons: Participants must give informed consent, meaning they understand the procedures, risks, and benefits and voluntarily agree to participate.
Beneficence: The study should maximize potential benefits and minimize harm.
Justice: The burdens and benefits of research should be distributed fairly.

Equipoise is a prerequisite for ethical randomization: there must be genuine uncertainty about which treatment is better. If strong evidence already favors one arm, it's unethical to randomize people away from it. Additional scrutiny applies when working with vulnerable populations or when participants receive limited direct benefit.

Generalizability of Results

Even a well-executed experiment may not generalize beyond its specific context. Factors that limit external validity include:

Narrow inclusion/exclusion criteria that produce an unrepresentative sample
Artificial study settings that differ from real-world implementation
Specific populations or geographic contexts that may not reflect the broader target

Replication across different settings, populations, and implementation conditions is the strongest way to establish that a treatment effect is robust and generalizable. When designing an experiment, think carefully about who your target population is and whether your study sample and conditions reflect it.

2,589 studying →