Download the notebook here!
Interactive online version:
Building an agentic evaluation system
The Impact Engine automates evidence assessment as the bridge between the Measure Impact and Allocate Resources stages of the decision pipeline. This lecture develops both the principles and the engineering patterns that make automated evidence assessment defensible. The principles — defensible confidence, evaluation architectures, and the evaluation harness — apply to any agentic evaluation system. The design patterns — registry dispatch, prompt engineering as software, layered specialization, and structured output — show how these principles are instantiated in production code.
In Part I we develop the principles for defensible automated assessment and then examine the design patterns that implement them. In Part II we read the source code of the Impact Engine to see each pattern in practice.
Part I: Principles and design patterns
Agentic systems use LLMs as reasoning components within a structured software pipeline — not open-ended chat, but constrained evaluation with typed inputs and outputs. We develop the material in two layers. The first establishes the principles that make automated assessment defensible — what the system produces, what guarantees it must satisfy, how evaluation architectures compose, and how the evaluation harness validates the whole. The second examines the software design patterns that implement those principles in the Impact Engine.
1. The evaluation task
The Measure Impact stage produces a collection of artifacts for each initiative: point estimates of causal effects, standard errors, diagnostic test results, and metadata about the estimation method. The Evaluate Evidence stage consumes these artifacts and produces a confidence assessment — a structured judgment of how much to trust each estimate before it informs resource allocation.
That confidence assessment can take several forms. It might be a continuous score on a [0, 1] scale, a categorical label (high / medium / low), or a multi-dimensional profile that separates statistical precision from design credibility. The representation matters less than the underlying requirement: the assessment must be grounded in the measurement artifacts, not in intuition or convention, and it must be reproducible across evaluations.
Lecture 1 developed the diagnostic framework for making these judgments manually — internal and external validity, statistical versus practical significance, method-specific checks. The question this lecture addresses is how to automate that judgment. The evaluator is no longer a human analyst reading a study but an LLM operating within a structured pipeline, processing measurement artifacts according to explicit rubrics. The principles and patterns that follow define what such a system must guarantee and how it is built.
2. Defensible confidence
A confidence score that cannot be defended is worse than no confidence score at all. This section develops the two ideas that make automated assessment defensible. The first is the precise role the LLM plays in the evaluation pipeline. The second is the four guarantees the system must satisfy.
The LLM as aggregator
The LLM’s role is to aggregate per-dimension diagnostics that the measurement engine already produces. It does not generate confidence from its own internal probabilities or invent evidence that the artifacts do not contain. Each dimension — randomization integrity, statistical inference, threats to validity, and so on — receives a score grounded in specific artifact values. The LLM synthesizes these dimension-level assessments into an overall confidence judgment, interpreting the diagnostics within a constrained evidence set.
This division of labor is what makes the output auditable. The LLM is bounded to interpretation of evidence that exists, not generation of evidence that does not. Whether that interpretation is accurate becomes an empirical question that the evaluation harness in Section 4 addresses.
The four pillars
Four pillars define the guarantees that a defensible evaluation system must satisfy. They form a dependency chain — each pillar builds on the one before it.
Groundedness is the foundation. Every score the system produces must reference observable artifacts — specific diagnostic values, test statistics, or metadata fields from the measurement output. A system that assigns confidence without pointing to concrete evidence has no basis for its claims. Without groundedness, there is nothing for the remaining pillars to build on.
Correctness requires that the system read the evidence accurately. A grounded system might still misinterpret a diagnostic — treating a failed balance test as passing, or inventing a concern that the artifacts do not support. Correctness means the system’s interpretation of each artifact aligns with what the artifact actually shows. This is the pillar that external validity tests (known-flaw detection, known-clean scoring) directly assess.
Traceability makes correctness inspectable. When the system produces a confidence score, the reasoning path from artifact values through dimension-level assessments to the final judgment must be fully visible. When an interpretation is wrong, the audit trail reveals exactly where the error entered — which dimension, which artifact value, which step in the reasoning. Without traceability, errors are opaque and debugging becomes guesswork.
Reproducibility ensures that the first three pillars hold consistently, not just on a single evaluation run. The same artifacts evaluated under the same configuration must produce the same scores. If the system gives different answers each time it runs, groundedness, correctness, and traceability lose their meaning — a correct answer on one run and an incorrect answer on the next provides no reliable basis for decisions.
3. Evaluation architectures
An evaluation architecture defines how many LLM passes the system uses and how they relate to each other. Four building blocks capture the space of possibilities.
Building blocks
A single pass sends the measurement artifacts through one LLM call and returns a confidence assessment. This is the simplest architecture and the cheapest to run.
A parallel architecture sends the same artifacts through multiple independent LLM calls — possibly different models, possibly different prompt variants — and aggregates the results. Independence across calls means that systematic biases in one model do not contaminate the others. Disagreement across parallel calls is itself a diagnostic signal, often indicating that the evaluation rubric is under-specified.
A sequential architecture chains two or more passes, where each pass can see the output of the previous one. The second pass acts as a critic, checking whether the first pass’s reasoning adequately addresses the evidence and challenging gaps or unsupported claims. Sequential depth comes at the cost of latency and token usage.
An adversarial architecture places two LLM calls in opposition — one argues for high confidence, the other argues against — and a structured resolution produces the final assessment. This forces explicit engagement with the strongest counterargument, surfacing weaknesses that a single perspective might overlook.
Named instances
These building blocks map to named evaluation patterns in the literature:
Pattern |
Building Block |
Structure |
|---|---|---|
Judge |
Single pass |
One LLM, one call |
Jury |
Parallel |
Multiple LLMs on the same artifact with the same prompt |
Reviewer |
Sequential |
First pass produces assessment, second pass critiques it |
Debate |
Adversarial |
Two LLMs argue opposing positions, structured resolution decides |
Compositional freedom
The building blocks combine freely. A system might run a Jury of three models, feed the aggregated result into a sequential Reviewer pass, and reserve adversarial Debate for a subset of high-stakes initiatives. Nothing about the building blocks prescribes a fixed progression or a preferred default.
Which composition works for a given evaluation task is an empirical question, not a theoretical one. Running two LLMs in parallel and selecting one assessment at random might outperform a carefully designed Reviewer chain — or it might not. The only way to know is to measure. The evaluation harness developed in Section 4 provides the infrastructure for making these comparisons systematic.
4. The evaluation harness
The evaluation harness is the infrastructure for systematically validating an agentic evaluation system — analogous to a testing harness in software engineering. It provides the mechanism for answering empirical questions about evaluation architectures. Does adding a second LLM pass actually improve scores? Does switching from a Judge to a Jury reduce variance? Without a harness, these questions remain matters of intuition rather than evidence.
Assess and improve
The harness supports two modes of use that must be kept strictly separate.
In Assess mode, the harness measures the system’s current performance. It runs the evaluation pipeline across a bank of test artifacts, records scores and reasoning traces, and produces a scorecard. The harness does not change the system during assessment — Assess mode is read-only.
In Improve mode, the team acts on what Assess mode revealed. This might mean modifying prompts, refining rubrics, tightening output schemas, or switching to a different evaluation architecture. The critical constraint is that every change is validated on held-out artifacts — artifacts that were not part of the diagnosis. Fixing the system against the same artifacts that revealed a problem overfits the fix to specific cases, the same mistake made when tuning a model against its test set.
Assess and Improve are not separate concepts from the harness — they are the discipline for using it. Assess mode is running the harness. Improve mode is acting on what the harness reveals.
Internal validity tests
Internal validity tests check whether the system behaves coherently under variation in its own components, without requiring ground truth.
Run-to-run stability evaluates whether the same artifact under the same configuration produces the same score across repeated evaluations. Instability here undermines the Reproducibility pillar directly — a system that gives different answers each time cannot be meaningfully tested against any benchmark.
Prompt sensitivity checks whether semantically equivalent prompts produce consistent scores. If rephrasing a rubric question changes the confidence assessment, the evaluation task is not well-defined enough for the LLM to interpret reliably.
Backend sensitivity compares scores for the same artifact across different LLM backends. High divergence indicates that the evaluation result depends more on which model is running than on what the evidence shows — a signal that the rubric needs tightening or that a parallel (Jury) architecture may be warranted.
Score distribution examines whether the system uses the full scoring range or clusters around a narrow band. A system that assigns 0.7 to everything is not discriminating between strong and weak evidence.
Internal validity is the precondition for external validity. A system that produces unstable scores cannot be meaningfully tested against ground truth.
External validity tests
External validity tests check whether the system gets the right answer, using synthetic artifacts with known properties.
Known-flaw detection presents the system with artifacts that contain deliberate defects — poor covariate balance, high attrition, violated parallel trends. The system must flag these defects and lower its confidence assessment accordingly. Failure to detect known flaws indicates a gap in the rubric or a limitation of the LLM’s interpretive capability.
Known-clean scoring presents the system with a well-powered design that has clean diagnostics across all dimensions. The system must score it highly without inventing concerns that the artifacts do not support. Over-penalization of clean evidence is as problematic as under-penalization of flawed evidence.
Severity calibration tests whether scores decrease monotonically as flaw severity increases. An artifact with moderate attrition should score lower than one with no attrition, and higher than one with severe attrition. Monotonicity failures indicate that the system’s scoring is not properly anchored to the underlying evidence quality.
5. Design patterns
The principles above define what the evaluation system must guarantee. This section examines how those guarantees are implemented. We first trace the data flow through the pipeline — what enters, what transforms, what exits — and then map each stage to the design pattern that implements it.
Data flow
The architecture diagram below shows the full evaluation pipeline:
Reading top to bottom, Measurement Results from the Measure Impact stage enter the system. Each result includes a job manifest identifying the causal method used (experiment, difference-in-differences, synthetic control) along with the corresponding diagnostic artifacts.
The Evaluation Router reads the method type from the manifest and dispatches the artifacts to the correct method-specific reviewer. An experiment and a synthetic control study require different diagnostic checks, so the routing decision determines which expertise the LLM brings to the evaluation.
Inside the Evaluation Engine, the Prompt Builder assembles the evaluation prompt. It renders a versioned template with the artifact data injected into designated slots and appends domain knowledge documents that encode design principles, common pitfalls, and diagnostic standards for the relevant method.
The rendered prompt is sent to one or more LLM Backends through the method-specific agents. Each agent (Experiment, Diff-in-Diff, Synth. Control) carries its own prompt templates and knowledge files but shares the same orchestration logic. The LLM produces a structured response with per-dimension scores and justifications.
The Results Builder parses the LLM’s output into typed objects — per-dimension scores linked to justifications, an overall confidence score, and the raw response preserved for audit. The final Evaluation Result exits the pipeline as a machine-readable assessment ready for downstream consumption.
From flow to implementation
Each stage in the data flow maps to a design pattern that enforces a specific pillar:
Pipeline Stage |
Design Pattern |
Pillar Enforced |
|---|---|---|
Evaluation Router |
Registry + Dispatch |
Correctness — each method gets its own specialized reviewer |
Prompt Builder |
Prompt Engineering as Software |
Reproducibility — fixed templates, injected knowledge |
Method-specific agents |
Layered Specialization |
Correctness — method-specific expertise with uniform interface |
Results Builder |
Structured Output |
Groundedness — every score is linked to a named dimension |
The following subsections examine each pattern in detail.
Registry + dispatch
A system that supports multiple causal methods — experiments, matching, synthetic control — needs to route each evaluation to the handler that understands that method’s diagnostics. Hardcoding this routing creates fragile code that must be modified every time a new method is added.
The registry pattern solves this by separating what handlers exist from how they are selected. Each method-specific reviewer registers itself with a central registry, keyed by the method name. At evaluation time, the system reads the method identifier from the measurement output and dispatches to the correct reviewer automatically. Adding support for a new methodology means implementing a new reviewer and registering it — the dispatch logic remains unchanged.
Each method gets a reviewer trained on its own diagnostics rather than a generic handler that might misinterpret method-specific artifacts. The same dispatch mechanism extends to other variation points in the system, such as routing to different LLM backends based on configuration.
Prompt engineering as software
Prompts written inline as strings become unmaintainable — they mix concerns, lack versioning, and cannot be reviewed like code. Treating prompts as versioned artifacts with explicit metadata solves all three problems.
Each prompt template declares the dimensions it scores and the structure of the expected response. A template engine injects artifact data at render time, separating what to evaluate from how to evaluate. The same template version, given the same artifact, produces the same prompt — regardless of when or where the evaluation runs.
Knowledge injection complements the template. Domain expertise files — documents encoding design principles, common pitfalls, and diagnostic standards for each causal method — are loaded from disk and inserted into the prompt alongside the artifact data. This grounds the LLM’s assessment in documented domain knowledge rather than relying solely on its training data. Updating the knowledge base does not require changing the prompt template, and updating the template does not require changing the knowledge base. The two evolve independently.
Layered specialization
Multiple method reviewers share common orchestration logic — load the artifact, render the prompt, call the LLM, parse the response — but differ in method-specific details. Each reviewer needs its own prompt template, its own knowledge files, and its own confidence range. Duplicating the orchestration code across reviewers creates maintenance burden and inconsistency.
An abstract base class defines the interface that all reviewers must satisfy. Concrete subclasses supply only the method-specific details — where to find the prompt template, which knowledge files to inject, what confidence range to assign. The orchestration layer operates against the interface, unaware of which concrete class it is using.
Every reviewer, regardless of method, produces the same structured output with per-dimension scores and justifications. Adding a new methodology means implementing one subclass with the method-specific details — the orchestration logic remains unchanged.
Structured output
LLMs produce free-form text, but downstream systems need typed, machine-readable data. Constraining the LLM’s output to a structured schema and parsing the response into typed objects bridges this gap.
The output flows through a constrain → parse → validate cycle. The prompt specifies the exact output format — per-dimension blocks with score and justification fields. The LLM backend enforces a response schema so the output is machine-parseable. The parsed response is assembled into a typed result object with per-dimension scores, justifications, an overall confidence score, and the raw LLM response preserved for audit.
Every confidence score is linked to a named dimension with a justification that references specific artifact values. There is no overall score without the dimension-level evidence that supports it. The system preserves the raw LLM response alongside the parsed result, providing a complete audit trail from the final score back to the exact text the LLM produced.
Additional resources
Anthropic (2024). Building Effective Agents. Anthropic Blog.
Gamma, E., Helm, R., Johnson, R. & Vlissides, J. (1994). Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley.
Grace, J. & Kwatra, S. (2025). Evals API Use-case: Tools Evaluation. OpenAI Cookbook.
Kwatra, S., Wimberly, H., Marker, J. & Siegel, E. (2025). Eval-Driven System Design: From Prototype to Production. OpenAI Cookbook.
Shankar, S., Zamfirescu-Pereira, J.D., Hartmann, B. & Parameswaran, A.G. (2024). Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences. Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology (UIST), 1–20.
Zheng, L. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Advances in Neural Information Processing Systems, 36, 46595–46623.