Download the notebook here!
Interactive online version:
Evaluating causal evidence
Measuring causal effects is only the first step. Before using an estimate to guide a business decision, we need to ask: How much should we trust this number? Two questions structure this lecture. First, what general principles separate trustworthy evidence from unreliable evidence? Second, what specific diagnostic checks test each method’s identifying assumptions? Section 1 develops the conceptual toolkit — validity, the hierarchy of evidence designs, and the two-stage progression from a raw estimate to actionable evidence. Is the number reliable, and does it actually matter? Section 2 turns to the method-specific diagnostics for experiments, matching, and synthetic control.
1. Evaluating evidence
Internal and external validity
Every causal study faces two distinct questions about the quality of its evidence. Internal validity asks whether the causal estimate is correct for the study population. Threats to internal validity include selection bias, confounding, and measurement error. When any of these are present, the number itself may be wrong. External validity asks whether the estimate generalizes to other populations or settings. An internally valid estimate from a narrow population may still fail to predict outcomes in a different market, time period, or customer segment.
Validity Type |
Question |
Threats |
|---|---|---|
Internal validity |
Is the causal estimate correct for this study population? |
Selection bias, confounding, measurement error |
External validity |
Does the estimate generalize to other populations or settings? |
Sample selection, context dependence, time effects |
Internal validity is a prerequisite. An estimate that is wrong for the study population tells us nothing about any other setting. But even a perfectly valid estimate may not travel well, so both questions matter for decision-making.
Hierarchy of evidence
Not all research designs provide equally credible evidence. The strength of a causal claim depends on how effectively the design rules out alternative explanations for the observed association.
At the top sit experiments, where random assignment of units to treatment and control eliminates selection bias by construction. The middle tier contains observational causal studies — methods such as matching, difference-in-differences, instrumental variables, regression discontinuity, and synthetic control — that attempt to mimic random assignment by exploiting natural variation or conditioning on observables. These designs produce credible estimates when their identifying assumptions hold, but those assumptions cannot be verified from the data alone. The bottom tier contains time-series approaches that track a metric before and after an intervention without constructing an explicit counterfactual. These designs are the easiest to implement but the most vulnerable to confounding from concurrent events.
A deeper distinction cuts across these tiers. Design-based methods derive their credibility from features of the research design itself — randomization, a natural experiment, a sharp eligibility threshold — rather than from modeling assumptions about functional form. Model-based methods, by contrast, rely on getting the statistical model right (the correct covariates, the correct functional relationship) to identify the causal effect. Design-based approaches are more robust because their validity does not hinge on whether the researcher chose the right specification. Experiments are design-based by construction. Among observational methods, regression discontinuity and instrumental variables lean design-based, while standard regression and matching lean model-based (though matching with strong overlap can approach design-based credibility).
The hierarchy is a useful prior about design potential, but execution quality can invert it. A badly run experiment — one with broken randomization, high differential attrition, or widespread non-compliance — can produce worse evidence than a carefully executed observational study with strong diagnostics. The tier tells you the ceiling of what a design can deliver. How well it is implemented determines where you actually land. This is why Section 2 devotes so much attention to method-specific diagnostic checks. They are what separate a design’s theoretical promise from its realized credibility.
Higher tiers are not always feasible. The goal is to use the strongest design available and to be transparent about the assumptions required by weaker designs.
From estimate to evidence
A single number from a single study — even a well-designed one — is still just one number. Turning that number into evidence requires answering two questions in sequence. Is this number reliable? And does it actually matter? The first question earns trust in the estimate. The second asks whether a trusted estimate warrants action.
Stage 1: is this number reliable?
Three complementary strategies probe whether a causal estimate deserves confidence, each attacking it from a different angle.
Robustness checks vary the analytical choices that the researcher controls and ask whether the estimate holds. Every study requires decisions about model specification, sample boundaries, and which covariates to include. Reasonable analysts could make different choices at each step. If the central result survives a range of these alternatives, you have a cloud of estimates pointing in the same direction rather than a lone point. If the result is sensitive to seemingly innocuous choices, that fragility is itself informative.
Sensitivity analysis asks how severely the method’s key assumptions would need to be violated to overturn the result. Rather than testing whether the assumptions hold — which is typically impossible with observational data — it quantifies the margin of safety. An estimate that survives large hypothetical violations is more credible than one that collapses under modest departures from the identifying assumptions. Each causal method has its own sensitivity tools. Rosenbaum bounds apply to matching, Oster bounds to regression, and pre-treatment fit sensitivity to synthetic control. Section 2 covers these method-specific implementations in detail.
Placebo tests apply the causal method in settings where the true effect is known to be zero. If the method detects a significant effect where none should exist, something is wrong with the design, the data, or the implementation. Three variants target different dimensions of the analysis: outcomes that should be unaffected, time periods before the intervention occurred, and units that were never treated.
Type |
Approach |
Example |
|---|---|---|
Placebo outcomes |
Apply the method to outcomes that should not be affected by treatment |
Treatment is a marketing campaign. The placebo outcome is warehouse temperature |
Placebo treatments |
Assign treatment at a time or threshold where no real treatment occurred |
In a DiD design, test for a “treatment effect” two periods before the actual intervention |
Placebo units |
Apply the method to units that were not actually treated |
In synthetic control, run the method on a control unit and check if a gap appears |
An estimate that survives all three stress tests has earned a degree of trust. The next question is whether that trusted number is worth acting on.
Stage 2: does it actually matter?
A result can be statistically significant without being practically meaningful, and the reverse is equally true. Statistical significance asks whether the observed effect could have arisen by chance alone, typically assessed through p-values and confidence intervals. Practical significance asks whether the effect is large enough to matter for the decision at hand, assessed through effect sizes and cost-benefit analysis.
Large samples can make even trivially small effects statistically significant, while small samples may fail to detect genuinely important effects. Decision-makers need evidence on both fronts. The effect must be real, and it must be large enough to justify action.
Replication
Stress tests probe a single study from within. Replication tests the conclusion from outside: different researchers, working with different data and different methods, ask the same causal question and check whether they reach similar conclusions. When multiple independent studies converge on the same answer, the evidence is far more credible than any single estimate — however thoroughly stress-tested — can be on its own.
2. Method-specific diagnostics
Each causal method rests on its own set of identifying assumptions, and each assumption has diagnostic checks designed to detect violations. The diagnostics below cover the three methods developed in this course — experiments, matching, and synthetic control. Other designs (difference-in-differences, instrumental variables, regression discontinuity) have analogous checks.
Experiments (RCTs)
The credibility of an experiment rests on whether randomization actually delivered comparable groups and whether that comparability survived through the end of the study.
Randomization integrity verifies that treatment and control groups are balanced on observable characteristics at baseline. If randomization worked, the two groups should look statistically indistinguishable before the intervention begins. The standard diagnostic compares covariate means across groups using standardized mean differences or joint balance tests. Systematic differences in baseline characteristics signal that random assignment may have been compromised — whether through a flawed randomization procedure, post-assignment sorting, or selective enrollment.
Attrition tracks whether units drop out of the study at different rates depending on their treatment status. Even if randomization was flawless at the start, differential attrition can destroy the balance it created. When treated units who experience negative outcomes are more likely to leave the sample, the remaining treated group looks artificially good — and the estimated effect is biased upward. Attrition diagnostics compare dropout rates by treatment arm and test whether dropouts differ systematically from completers on baseline characteristics.
The remaining diagnostics are summarized in the table below.
Diagnostic |
What It Checks |
Red Flag |
|---|---|---|
Randomization integrity |
Covariate balance between treatment and control |
Systematic differences in baseline characteristics |
Attrition |
Whether dropout rates differ by treatment status |
Differential attrition compromises random assignment |
Non-compliance |
Whether all units received their assigned treatment |
High non-compliance dilutes the estimated effect |
Spillover |
Whether treatment of one unit affects others |
Interference between units biases the estimate |
Matching and subclassification
Matching methods condition on observed covariates to make treated and control groups comparable. Their diagnostics focus on verifying that comparability, detecting gaps in overlap, and bounding the influence of unobserved confounders.
Covariate balance measures how similar the groups are on pre-treatment characteristics after adjustment. The standard metric is the standardized mean difference, which expresses the gap between treated and control group means in units of pooled standard deviations. A common threshold is an absolute standardized mean difference below 0.1, though this is a guideline rather than a rule. Love plots display standardized mean differences for all covariates before and after adjustment, making it straightforward to assess whether matching has improved balance.
Common support — or overlap — requires that for any combination of covariate values, there is a positive probability of being in either the treatment or the control group. Violations mean the method is extrapolating rather than comparing like with like. Propensity score histograms and trimming rules are the standard diagnostics.
Diagnostic |
What It Checks |
Red Flag |
|---|---|---|
Balance improvement |
Whether matching reduced covariate imbalance (Love plots, SMD) |
SMDs remain large after matching |
Common support |
Whether treated and control overlap in covariate space |
Many treated units have no comparable controls |
Hidden bias sensitivity |
How large an unobserved confounder would need to be (Rosenbaum \(\Gamma\)) |
Effect disappears at low values of \(\Gamma\) |
Synthetic Control
Synthetic control methods construct a counterfactual by weighting untreated units to match the treated unit’s pre-treatment trajectory. Their diagnostics test the quality of that match and the statistical significance of the post-treatment gap.
Pre-treatment fit is the foundation. If the synthetic control does not track the treated unit closely before the intervention, the post-treatment gap is not credible. Pre-treatment root mean squared prediction error (RMSPE) quantifies the fit. When pre-treatment trends between the synthetic and treated unit diverge, the estimated effect may reflect pre-existing differences rather than the intervention. A poor pre-treatment fit does not necessarily mean the method failed — it may mean that the available donor pool cannot reproduce the treated unit’s trajectory, which is itself a finding about the limits of the comparison.
Placebo gaps test whether the post-treatment effect is unusual relative to what the method finds for untreated units. By running the synthetic control method on each unit in the donor pool — none of which received the actual treatment — you generate a distribution of placebo effects. If many placebo units show gaps as large as the treated unit, the estimated effect is not distinguishable from noise. The standard visualization overlays placebo gaps on the treated unit’s gap; an effect that stands out from the placebo distribution is more credible. RMSPE ratios (post-treatment RMSPE divided by pre-treatment RMSPE) formalize this comparison by accounting for differences in pre-treatment fit quality across units.
Diagnostic |
What It Checks |
Red Flag |
|---|---|---|
Pre-treatment fit (RMSPE) |
How well the synthetic control tracks the treated unit before treatment |
Large pre-treatment gaps undermine post-treatment comparisons |
Placebo gaps |
Whether placebo units show similar post-treatment gaps (see Section 1) |
Many placebos have gaps as large as the treated unit |
RMSPE ratios |
Post/pre RMSPE ratio relative to the placebo distribution |
Treated unit’s ratio is not extreme relative to placebos |
Donor composition |
Whether the synthetic control relies on sensible weights |
Weights concentrated on dissimilar units |
Additional resources
Angrist, J. D. & Pischke, J.‑S. (2010). The credibility revolution in empirical economics: How better research design is taking the con out of econometrics. Journal of Economic Perspectives, 24(2), 3–30.
Athey, S. & Imbens, G. W. (2017). The econometrics of randomized experiments. In Handbook of Economic Field Experiments (Vol. 1, pp. 73–140). Elsevier.
Imbens, G. W. (2020). Potential outcome and directed acyclic graph approaches to causality: Relevance for empirical practice in economics. Journal of Economic Literature, 58(4), 1129–1179.
Oster, E. (2019). Unobservable selection and coefficient stability: Theory and evidence. Journal of Business & Economic Statistics, 37(2), 187–204.
Rosenbaum, P. R. (2002). Observational Studies (2nd ed.). Springer.