Evaluate Evidence

With causal estimates produced in the Measure Impact stage, the next step is to assess how much to trust them before they can inform the Allocate Resources stage. Most decision pipelines skip this step entirely, passing point estimates downstream without any structured assessment of epistemic quality. The consequences compound: when we receive ten initiative-level return estimates, we have no principled way to distinguish a well-powered randomized experiment from a time-series model fit on sparse, noisy data. Resources flow indiscriminately — too much to poorly measured initiatives, too little to well-measured ones.

The Evaluate Evidence stage in the Learn, Decide, Repeat framework

Framework — Evaluate Evidence

Every causal estimate carries two kinds of uncertainty. Statistical uncertainty — captured in confidence intervals and standard errors — reflects sampling variability and is already part of the measurement output. Epistemic uncertainty — whether the design is credible, whether its assumptions hold, whether its diagnostics pass — is not. Structured evaluation of this second kind of uncertainty is a much younger discipline than causal inference itself, with few textbook treatments and many open questions. These lectures develop the concepts, systems, and workflows for closing that gap.

Each application lecture uses one tool. The Impact Engine — Evaluate provides two evaluation strategies: a deterministic scorer based on the hierarchy of evidence designs, and an LLM-powered reviewer that reads the actual measurement artifacts. Both produce a confidence score that penalizes projected returns downstream, so better evidence enables better allocation decisions.

Tool

Role

Impact Engine — Evaluate

Confidence scoring for causal estimates

The material is organized in three sections. Evidence quality develops the conceptual toolkit for judging causal evidence — validity, diagnostic checks, and the hierarchy of designs. Automated assessment develops the principles and design patterns for building agentic evaluation systems that produce defensible confidence scores automatically. Evaluation pipeline runs the full pipeline end-to-end, demonstrating how automated assessment translates measurement output into the confidence-weighted returns that drive resource allocation.

Evidence quality

This section develops the diagnostic framework for judging causal evidence — the vocabulary and checks that distinguish trustworthy estimates from unreliable ones, whether applied manually or by an automated system.

Causal diagnostics

We introduce internal and external validity, statistical versus practical significance, and the hierarchy of evidence designs. We then examine the diagnostic checks shared across all causal methods and the method-specific tests for experiments, matching, and synthetic control.

Automated assessment

This section shifts from what to evaluate to how to build systems that evaluate quality at scale. It develops the principles that make automated confidence scoring defensible — the evaluation task, the pillars that guarantee defensible confidence, evaluation architectures, and the evaluation harness — and then examines the software patterns that enforce those principles in the Impact Engine — Evaluate.

Agentic evaluation

We develop the four pillars of defensible confidence, evaluation architectures as compositional building blocks (Judge, Jury, Reviewer, Debate), and the evaluation harness that validates the system through internal and external validity tests. We then examine how the Impact Engine — Evaluate enforces these principles through registry dispatch, prompt engineering as software, layered specialization, and structured output.

Evaluation pipeline

This section applies the conceptual and engineering foundations from the preceding parts. It runs the full evaluation pipeline on mock measurement output and validates the system’s ability to discriminate between strong and weak evidence.

Automated review

We run the Impact Engine — Evaluate end-to-end, applying both evaluation strategies — deterministic scoring and agentic review — to mock measurement output. The lecture then examines the Correctness pillar directly: by running known-clean and known-flaw artifacts through the reviewer, we demonstrate how an automated assessment system can be validated in practice.