Structural causal models (SCMs) combine graphs and equations to represent how variables influence each other in a system. They let you move beyond correlation to estimate causal effects, evaluate interventions, and reason about counterfactuals. An SCM encodes your assumptions about causal structure, then gives you the mathematical machinery to work with those assumptions rigorously.

Definition of SCMs

An SCM is a mathematical model describing the causal relationships between variables in a system. Formally, it consists of three pieces:

A set of variables (both observed and unobserved)
A directed acyclic graph (DAG) representing the causal structure, where edges indicate direct causal influences
A set of structural equations specifying the functional relationship between each variable and its direct causes

Together, these components also imply a joint probability distribution over the variables. The DAG tells you the qualitative story (what causes what), and the structural equations tell you the quantitative story (how much, and in what way).

Endogenous vs. exogenous variables

SCMs distinguish between two types of variables:

Endogenous variables are determined by other variables within the model. They have at least one incoming edge in the DAG.
Exogenous variables are determined by factors outside the model. No other variable in the SCM causes them, so they appear as root nodes (no incoming edges) in the DAG.

Exogenous variables are typically assumed to be independently distributed. They represent the "inputs" to the system that the model takes as given rather than explaining.

Structural equations

Structural equations specify how each endogenous variable is generated from its direct causes plus an error term. The error term captures unobserved factors or randomness.

For example, if $Y$ is caused by $X$ :

$Y = f(X, \epsilon_Y)$

Here $f$ is some function, $X$ is the direct cause, and $\epsilon_Y$ is the error term. A critical point: structural equations are not the same as regression equations. A regression equation describes a statistical association, while a structural equation claims a causal mechanism. Changing $X$ in a structural equation changes $Y$ ; changing $X$ in a regression equation doesn't necessarily mean anything causal.

Directed acyclic graphs (DAGs)

DAGs are the graphical backbone of an SCM:

Nodes represent variables
Directed edges (arrows) represent direct causal influences
The absence of an edge between two nodes means there is no direct causal relationship between them

The "acyclic" part means there are no directed cycles. You can't follow the arrows from a variable and loop back to it. This rules out feedback loops (though extensions exist for cyclic systems, covered later).

Causal Markov condition

The causal Markov condition is a core assumption linking the DAG to the probability distribution. It states:

A variable is independent of all its non-descendants, given its parents (direct causes) in the DAG.

This assumption allows you to factorize the joint distribution according to the DAG structure. Each variable's conditional distribution depends only on its parents.

For example, in the chain $X \rightarrow Y \rightarrow Z$ , the causal Markov condition tells you that $X$ and $Z$ are independent given $Y$ . Once you know $Y$ , learning $X$ gives you no additional information about $Z$ .

Causal sufficiency

Causal sufficiency assumes that all common causes of the observed variables are included in the model. In other words, there are no unmeasured confounders affecting multiple observed variables.

This is a strong assumption. In practice, it often doesn't hold perfectly, and violations can lead to biased causal effect estimates. When you suspect unmeasured confounding, you'll need techniques like sensitivity analysis or instrumental variables (discussed below).

Causal faithfulness

Causal faithfulness assumes that the only independencies in the data are those implied by the DAG. There are no "accidental" independencies caused by parameters perfectly canceling each other out.

For example, if two causal paths from $X$ to $Y$ have effects that exactly cancel (one positive, one negative, same magnitude), $X$ and $Y$ would appear independent even though causal paths exist. Faithfulness rules this out. Such exact cancellations are considered rare in practice, but they can occur.

Representing interventions with SCMs

Interventions are actions that force variables to take specific values, overriding their natural causes. SCMs give you a precise way to represent and reason about interventions, which is what separates causal reasoning from purely statistical reasoning.

Interventional distributions

An interventional distribution is the probability distribution of variables after an intervention. It's written using the $do$ -operator:

$P(Y \mid do(X = x))$

This reads as "the probability of $Y$ when we set $X$ to value $x$ ." This is fundamentally different from the conditional distribution $P(Y \mid X = x)$ , which is what you'd observe by filtering data. The $do$ -operator represents actively manipulating $X$ , not passively observing it.

Graph mutilation

To derive an interventional distribution, you modify the DAG through graph mutilation:

Identify the variable $X$ being intervened on
Remove all incoming edges to $X$ (since the intervention overrides $X$ 's natural causes)
Set $X$ to the intervention value
The resulting "mutilated graph" represents the causal structure under the intervention

The rest of the DAG stays the same. Variables downstream of $X$ are still affected by $X$ , but $X$ itself is no longer influenced by its former parents.

Truncated factorization

Truncated factorization is the mathematical counterpart of graph mutilation. Under the original DAG, the joint distribution factorizes as:

$P(V_1, V_2, \ldots, V_n) = \prod_{i=1}^{n} P(V_i \mid PA_i)$

where $PA_i$ are the parents of $V_i$ . When you intervene on $X = x$ , you drop the factor for $X$ from the product (since $X$ is now fixed, not generated by its parents) and substitute $X = x$ everywhere else. The result is the interventional distribution over the remaining variables.

Identification of causal effects

Identification asks: can you compute a causal effect from observational data alone, given your assumed DAG? If yes, the effect is "identified." If not, you need additional data or assumptions. SCMs provide graphical criteria to answer this question.

Back-door criterion

The back-door criterion is the most commonly used identification tool. A set of variables $Z$ satisfies the back-door criterion relative to $(X, Y)$ if:

$Z$ blocks all back-door paths from $X$ to $Y$ (paths that enter $X$ through an incoming arrow)
No variable in $Z$ is a descendant of $X$

If such a $Z$ exists, you can identify the causal effect by adjusting for $Z$ :

$P(Y \mid do(X = x)) = \sum_z P(Y \mid X = x, Z = z) \cdot P(Z = z)$

For example, in the DAG $X \leftarrow Z \rightarrow Y$ , the confounder $Z$ creates a back-door path. Conditioning on $Z$ blocks it, satisfying the criterion.

Front-door criterion

The front-door criterion applies when no set of variables satisfies the back-door criterion (e.g., because the confounder is unmeasured). A set $Z$ satisfies the front-door criterion relative to $(X, Y)$ if:

$Z$ intercepts all directed paths from $X$ to $Y$
There are no unblocked back-door paths from $X$ to $Z$
All back-door paths from $Z$ to $Y$ are blocked by $X$

A classic example: $X \rightarrow Z \rightarrow Y$ with an unmeasured confounder $U$ affecting both $X$ and $Y$ . You can't adjust for $U$ directly, but $Z$ (the mediator) satisfies the front-door criterion, letting you identify the causal effect through a two-step adjustment.

Instrumental variables

Instrumental variables (IVs) provide another identification strategy when unmeasured confounders are present. A variable $Z$ is a valid instrument for the effect of $X$ on $Y$ if:

$Z$ is associated with $X$ (relevance)
$Z$ does not directly affect $Y$ except through $X$ (exclusion restriction)
$Z$ is independent of all confounders of the $X$ - $Y$ relationship (independence)

A well-known example: estimating the effect of education on income. A person's quarter of birth affects years of education (through compulsory schooling laws) but has no direct effect on income, making it a candidate instrument.

Mediation analysis

Mediation analysis decomposes the total causal effect of $X$ on $Y$ into:

Direct effect: the effect of $X$ on $Y$ not passing through a mediator $M$
Indirect effect: the effect of $X$ on $Y$ that operates through $M$

For example, a drug ( $X$ ) might lower blood pressure ( $Y$ ) both directly and indirectly by reducing heart rate ( $M$ ). SCMs let you define and estimate these effects precisely, though mediation analysis requires assumptions about the absence of unmeasured confounders for both the $X \rightarrow Y$ and $M \rightarrow Y$ relationships.

Counterfactuals in SCMs

Counterfactuals ask "what would have happened if things had been different?" SCMs handle counterfactuals by using the structural equations to simulate alternative scenarios while holding the exogenous variables (the background context) fixed.

Potential outcomes framework

The potential outcomes framework (also called the Rubin Causal Model) defines causal effects in terms of hypothetical outcomes. For a binary treatment:

$Y(1)$ : the outcome if the unit receives treatment
$Y(0)$ : the outcome if the unit does not receive treatment

The individual causal effect is $Y(1) - Y(0)$ . The fundamental problem is that you only observe one of these for each unit. SCMs and potential outcomes are complementary frameworks: SCMs provide the structural machinery, while potential outcomes provide a clean notation for defining effects.

Counterfactual queries

Counterfactual queries ask about outcomes under hypothetical interventions. In an SCM, you answer them by:

Abduction: Use the observed data to infer the values of the exogenous variables (error terms)
Action: Modify the structural equations to reflect the hypothetical intervention
Prediction: Use the modified equations with the inferred exogenous values to compute the counterfactual outcome

For example, "What would this patient's blood pressure have been without the drug?" You first use the patient's actual data to pin down their individual error terms, then re-run the model with $X = 0$ (no drug) to predict the counterfactual blood pressure.

Note that this is different from the interventional query $P(Y \mid do(X = 0))$ , which asks about a population-level effect. The counterfactual is specific to a particular individual with known characteristics.

Twin networks

Twin networks are a graphical tool for computing counterfactual quantities. They work by creating two copies of the SCM:

The factual network represents what actually happened
The counterfactual network represents the hypothetical scenario

The two networks share the same exogenous variables, which ties the individual's background characteristics together across both worlds. You can then read off counterfactual quantities by comparing outcomes between the two networks.

For example, in a twin network for a drug study, the factual side shows the patient taking the drug and the observed outcome, while the counterfactual side shows the same patient not taking the drug, with the same exogenous factors.

Learning SCMs from data

Learning an SCM from data involves two tasks: discovering the causal structure (the DAG) and estimating the parameters of the structural equations. Both are challenging due to limited data, latent variables, and the fact that multiple DAGs can produce the same observed distribution.

Causal structure learning

Causal structure learning aims to infer the DAG from observational data. Methods fall into three categories:

Constraint-based methods (e.g., PC algorithm): use conditional independence tests to determine which edges belong in the DAG
Score-based methods (e.g., GES algorithm): search over possible DAGs to find the one that best fits the data according to a scoring function
Hybrid methods: combine both approaches

A key limitation: observational data alone can typically only identify the DAG up to its Markov equivalence class (a set of DAGs that encode the same conditional independencies). You may need additional assumptions or experimental data to distinguish between equivalent structures.

Constraint-based methods

Constraint-based methods rely on the causal Markov condition and faithfulness to infer the DAG. The PC algorithm is the most well-known:

Start with a fully connected undirected graph
For each pair of adjacent nodes, test whether they are conditionally independent given some subset of other variables
Remove edges where conditional independence is found
Orient edges using rules based on v-structures and acyclicity constraints

These methods are computationally efficient but sensitive to errors in independence tests, especially with small samples or violations of faithfulness.

Score-based methods

Score-based methods search for the DAG that optimizes a scoring function balancing data fit and model complexity. Common scores include the Bayesian Information Criterion (BIC) and the Akaike Information Criterion (AIC).

The Greedy Equivalence Search (GES) algorithm:

Start with an empty graph (no edges)
Forward phase: iteratively add edges that most improve the score
Backward phase: iteratively remove edges that improve the score
Return the highest-scoring DAG

Score-based methods are less sensitive to individual test errors than constraint-based methods, but the search space grows super-exponentially with the number of variables, making them computationally expensive for large systems.

Hybrid methods

Hybrid methods combine constraint-based and score-based approaches to get the best of both worlds. A typical strategy:

Use constraint-based methods to prune the search space (eliminate edges that are clearly absent)
Use score-based methods to search over the remaining candidate structures

The Max-Min Hill-Climbing (MMHC) algorithm is a prominent example. It uses the MMPC algorithm to identify candidate parent-child relationships, then applies hill-climbing search to find the best-scoring DAG within that restricted space. This achieves a balance between computational efficiency and robustness.

Parameter estimation

Once you have the DAG, you need to estimate the parameters of the structural equations. The two main approaches:

Maximum likelihood estimation (MLE): find parameter values that maximize the probability of observing the data, given the DAG structure
Bayesian methods: specify prior distributions over parameters and update them with the data to get posterior distributions

Both approaches require assumptions about the functional form of the structural equations (e.g., linear vs. nonlinear) and the distribution of error terms (e.g., Gaussian vs. non-Gaussian). Getting these assumptions wrong can lead to poor estimates.

Applications of SCMs

SCMs are used across many fields, including epidemiology, economics, social sciences, and AI. They provide a principled framework for several practical tasks.

Causal effect estimation

SCMs let you estimate causal effects from observational data by leveraging the assumed causal structure. For example, using electronic health records, you could estimate the causal effect of a medication on patient outcomes by identifying the right adjustment set from the DAG and applying the back-door adjustment formula.

Policy evaluation

By simulating interventions on an SCM, you can predict the effects of policies before implementing them. For instance, you could model the causal relationships between taxation, income inequality, and economic growth, then simulate different tax policies to compare their predicted outcomes.

Transportability

Transportability addresses whether causal effects estimated in one population generalize to another. SCMs provide formal tools for assessing this. For example, if a clinical trial estimates a drug's effect in one demographic group, transportability analysis can determine under what conditions that estimate applies to a different population with different demographics and comorbidities.

Causal discovery

SCMs also serve as the target for causal discovery: inferring the causal structure itself from observational data. This is valuable for generating hypotheses. For example, applying causal discovery algorithms to cohort study data might reveal previously unknown causal factors for a disease, guiding future experimental research.

Limitations and extensions of SCMs

SCMs are powerful but come with assumptions that don't always hold. Understanding these limitations helps you apply SCMs responsibly.

Latent confounding

Standard SCMs assume causal sufficiency (all common causes are measured). In practice, unmeasured confounders are common and can bias causal effect estimates. For example, in studying smoking and lung cancer, unmeasured genetic factors might influence both.

Extensions to handle this include:

Latent variable models that explicitly represent unmeasured confounders
Sensitivity analysis that quantifies how robust your conclusions are to potential unmeasured confounding
Bounds on causal effects when point identification isn't possible

Cyclic causal models

Standard SCMs require acyclicity, but many real-world systems involve feedback loops. Job satisfaction affects job performance, which in turn affects satisfaction. Cyclic causal models extend the SCM framework to handle such cases, though they require different assumptions and estimation techniques (e.g., equilibrium conditions or dynamic models).

Time-varying treatments

Standard SCMs typically model treatments as fixed at a single point in time. Many real-world treatments change over time (e.g., medication dosages adjusted based on patient response). Extensions like marginal structural models and structural nested models handle time-varying treatments by modeling the causal effects of treatment sequences rather than single treatment assignments. These methods use techniques like inverse probability weighting to account for time-varying confounding.