Instrumental variables (IVs) let you estimate causal effects when unmeasured confounding makes standard regression unreliable. The core idea: find a variable that nudges the treatment but has no other connection to the outcome. That variable becomes your instrument.

For an IV to work, four assumptions must hold: relevance, the exclusion restriction, exchangeability, and monotonicity. If any of these breaks down, your causal estimate can be biased or uninterpretable. The rest of this guide walks through each assumption, what happens when they fail, and how to think about choosing and interpreting IVs.

Relevance Assumption

Instrument Correlated with Treatment

The instrument must actually predict the treatment. If the instrument has no association with treatment assignment, it provides no useful variation to exploit, and the whole IV strategy falls apart.

The strength of this correlation directly affects the precision of your estimates. A weak instrument (only weakly correlated with treatment) inflates standard errors and can introduce finite-sample bias toward the confounded OLS estimate.

Strength of Instrument

You can assess instrument strength using the F-statistic from the first-stage regression (regressing the treatment on the instrument). The classic rule of thumb: an F-statistic above 10 suggests the instrument is strong enough, though more recent work by Stock and Yogo provides more nuanced thresholds.

The partial $R^2$ from that first-stage regression is another useful diagnostic. It tells you how much of the treatment variation the instrument explains after accounting for other covariates.

With weak instruments, you'll see:

Large standard errors and wide confidence intervals
IV estimates that are biased toward the OLS estimate in finite samples
Unreliable inference (confidence intervals with poor coverage)

Exclusion Restriction

Instrument Affects Outcome Only Through Treatment

The exclusion restriction states that the instrument $Z$ affects the outcome $Y$ only through the treatment $X$ . In DAG terms, there should be no direct arrow from $Z$ to $Y$ and no backdoor path from $Z$ to $Y$ that bypasses $X$ .

This is the assumption that does the heaviest lifting in an IV analysis, and it is untestable with data alone. You have to argue for it on substantive grounds.

Violations of Exclusion Restriction

If the instrument directly affects the outcome, the IV estimate conflates the causal effect of treatment with the instrument's direct effect. Suppose $Z$ is a physician's prescribing preference (used as an instrument for medication use). If that preference also correlates with other aspects of care the physician provides, the exclusion restriction is violated.

When the exclusion restriction fails, the IV estimate is biased, and the direction of bias depends on the sign and magnitude of the direct effect. Careful reasoning about the causal structure, not statistical tests, is the primary way to evaluate this assumption.

Exchangeability

Instrument Independent of Potential Outcomes

Exchangeability requires that the instrument $Z$ is independent of the potential outcomes $Y^a$ . Put differently, the instrument should be "as good as randomly assigned" with respect to the outcome. This rules out any common cause of $Z$ and $Y$ that doesn't go through $X$ .

This ensures that different levels of the instrument correspond to groups that are comparable in their potential outcomes, so the variation the instrument creates is unconfounded.

Instrument correlated with treatment, Frontiers | Mendelian Randomization With Refined Instrumental Variables From Genetic Score ...

Instrument Independent of Measured Confounders

One way to partially check this: examine whether measured covariates are balanced across levels of the instrument. If people with $Z = 1$ look systematically different from those with $Z = 0$ on observed characteristics, that's a red flag.

Balance checks on measured covariates are helpful but not definitive. Passing them is necessary but not sufficient for exchangeability.

Instrument Independent of Unmeasured Confounders

The harder part is that the instrument must also be independent of unmeasured confounders affecting both treatment and outcome. This is untestable by definition, since you can't check balance on variables you haven't measured.

The plausibility of this assumption rests entirely on your understanding of the data-generating process. Natural experiments (draft lotteries, geographic distance to a facility, Mendelian randomization) are popular IV strategies precisely because randomization or quasi-randomization makes this assumption more credible.

Monotonicity Assumption

Concept of Defiers

To understand monotonicity, you need the four compliance types defined by how individuals respond to the instrument:

Compliers: take treatment when $Z = 1$ , don't take it when $Z = 0$
Always-takers: take treatment regardless of $Z$
Never-takers: refuse treatment regardless of $Z$
Defiers: do the opposite of what the instrument encourages (take treatment when $Z = 0$ , refuse when $Z = 1$ )

Always-takers and never-takers don't contribute to the IV estimate because their treatment status doesn't change with the instrument. Compliers are the group whose behavior the instrument actually shifts.

Absence of Defiers

Monotonicity assumes there are no defiers in the population. The instrument must push everyone in the same direction (or not at all). Nobody systematically does the opposite of what the instrument encourages.

Why does this matter? If defiers exist, their treatment effects get mixed in with the compliers' effects but with the wrong sign, making the IV estimate uninterpretable. With monotonicity satisfied, the IV estimate has a clean interpretation as the local average treatment effect (LATE) for compliers.

In practice, monotonicity is most plausible when the instrument operates through a simple, one-directional mechanism. For example, being randomly assigned an encouragement to exercise is unlikely to cause anyone to exercise less.

Consequences of Violated Assumptions

Bias in IV Estimates

Each assumption violation produces different problems:

Weak instruments (relevance violation): bias toward the confounded OLS estimate, unreliable inference
Exclusion restriction violation: bias that reflects the instrument's direct effect on the outcome
Exchangeability violation: bias from confounding between the instrument and the outcome, similar in spirit to omitted variable bias in OLS
Monotonicity violation: the LATE interpretation breaks down, and the estimate may not correspond to any well-defined causal quantity

The severity of bias depends on how badly the assumption is violated. Small violations may produce small bias, but you generally can't know the magnitude without strong assumptions.

Sensitivity Analyses for Violations

Since most IV assumptions are untestable, sensitivity analyses help you assess how fragile your conclusions are:

For the exclusion restriction, you can ask: how large would a direct effect of $Z$ on $Y$ need to be to nullify or reverse the IV estimate?
For exchangeability, you can simulate unmeasured confounders of varying strength and see how the estimate shifts.
For instrument strength, you can report weak-instrument-robust confidence intervals (e.g., Anderson-Rubin confidence sets) that remain valid even with weak instruments.

These analyses don't prove your assumptions hold, but they show whether your conclusions depend on the assumptions holding exactly or whether they're robust to plausible violations.

Instrument correlated with treatment, Frontiers | Illustrating Instrumental Variable Regressions Using the Career Adaptability – Job ...

Considerations for Choosing Instruments

Subject Matter Knowledge

Good instruments come from understanding the causal structure of your problem. You need a convincing story for why the proposed instrument satisfies all four assumptions. Common sources of instruments include:

Policy changes or natural experiments (e.g., a lottery that determines eligibility)
Geographic or institutional variation (e.g., distance to a treatment facility)
Genetic variants (Mendelian randomization)

The best instruments have a clear, well-understood mechanism linking them to treatment and a compelling argument for why they don't affect the outcome through other channels.

Data-Driven Approaches

Statistical methods can help screen for candidate instruments by identifying variables that strongly predict treatment. Machine learning approaches can be useful here, particularly for constructing instruments from many weak predictors.

However, data-driven methods can only assess relevance. They cannot tell you whether the exclusion restriction or exchangeability holds. A variable that predicts treatment perfectly but violates the exclusion restriction is worse than useless. Always pair data-driven screening with substantive reasoning.

Interpretation of IV Estimates

Local Average Treatment Effect (LATE)

Under the four assumptions, the IV estimand is the LATE: the average treatment effect among compliers. This is not the average treatment effect for the whole population.

Compliers are the people whose treatment status actually changes in response to the instrument. If the instrument is a draft lottery for military service, compliers are those who serve when drafted but wouldn't volunteer otherwise. The LATE tells you the causal effect of service for that specific group.

Generalizability of IV Estimates

The LATE applies to compliers, and compliers may differ from always-takers, never-takers, and the general population. This limits external validity.

Key questions to ask:

Who are the compliers? Can you characterize them using observed covariates?
Are compliers likely to have larger or smaller treatment effects than the general population?
Does the specific instrument define a complier population that's relevant to the policy question you care about?

Different instruments for the same treatment can identify different complier populations and therefore produce different LATEs. This isn't a contradiction; it reflects genuine heterogeneity in treatment effects across subgroups.

Extensions of IV Methods

Multiple Instruments

When multiple instruments are available for the same treatment, you can:

Combine them to improve efficiency (tighter estimates)
Run overidentification tests (e.g., Sargan/Hansen test) to check whether the instruments produce consistent estimates. If they don't, at least one instrument likely violates the exclusion restriction.

Using multiple instruments requires that each instrument individually satisfies the IV assumptions. It also raises questions about whether the LATEs identified by different instruments are the same, which depends on treatment effect homogeneity.

Nonlinear Models with IV

Standard IV (two-stage least squares, or 2SLS) assumes a linear relationship between treatment and outcome. Extensions exist for nonlinear settings:

Binary outcomes can be handled with IV probit or bivariate probit models
Nonlinear first stages can be accommodated, though care is needed to avoid the "forbidden regression" (plugging fitted values from a nonlinear first stage into a linear second stage)

These extensions require additional modeling assumptions (correct functional form, distributional assumptions) on top of the standard IV assumptions, so they demand extra scrutiny.

2,589 studying →