Design
Motivation
Upstream pipeline stages produce structured artifacts — point estimates, confidence intervals, model diagnostics — that require expert judgement to interpret. Is the effect estimate plausible? Is the model type appropriate for the data? Are the diagnostics healthy?
The evaluate package provides a general-purpose agentic review layer that accepts any job directory conforming to the manifest convention, producing structured, auditable review judgements. A lightweight deterministic scorer is included for debugging, testing, and illustration — it assigns a confidence band based on methodology type alone without examining the content of the results.
Artifacts ──► Review strategy ──► per-dimension scores + justifications
└──► Score strategy ──► confidence score (0–1) [debug/test]
Architecture
┌─────────────────────────────────────────────────────┐
│ ReviewEngine │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ Backend │ │ PromptRegistry│ │KnowledgeBase│ │
│ │ Registry │ │ + Renderer │ │ (optional) │ │
│ └────┬─────┘ └──────┬───────┘ └─────┬──────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────────┐ ┌────────────┐ │
│ │Anthropic│ │ YAML/Jinja │ │ Static │ │
│ │ OpenAI │ │ Templates │ │ Markdown │ │
│ │ LiteLLM │ └─────────────┘ │ Files │ │
│ └─────────┘ └────────────┘ │
└─────────────────────────────────────────────────────┘
│
▼
ReviewResult
├── dimensions[] (name, score, justification)
├── overall_score
└── raw_response (audit trail)
Components
Symmetric Evaluate adapter
The Evaluate pipeline component uses symmetric strategy dispatch. Both
strategies share the same flow — only the confidence source differs:
manifest → reviewer → scorer_event → [confidence source] → EvaluateResult → write → return
evaluate_strategy(frommanifest.json) controls how to compute confidence (score vs review).model_typeselects theMethodReviewer(single source of truth for confidence range, prompt templates, knowledge, artifact loading).
Both strategies construct the same EvaluateResult, write
evaluate_result.json to the job directory, and return the same 8-key
output dict for downstream allocation. The manifest is treated as read-only.
Each strategy also writes its own strategy-specific result file:
Score:
score_result.json(ScoreResult— confidence + audit fields)Review:
review_result.json(ReviewResult— dimensions + justifications)
MethodReviewer provides a default load_artifact() implementation (reads
all manifest files, extracts sample_size from JSON). Subclasses override
only when they need method-specific loading.
File |
Role |
|---|---|
|
|
|
|
|
|
Review subsystem
File |
Role |
|---|---|
|
Data models: |
|
|
|
Public |
|
|
|
|
|
Experiment (RCT) reviewer with prompt templates and knowledge |
|
|
LLM backend
The review engine calls litellm.completion() directly with a Pydantic
response_format (ReviewResponse), producing structured JSON that maps
directly to dimension scores and an overall score. LiteLLM wraps 100+
providers, so any model supported by LiteLLM can be used by setting the
model field in config.
Registry pattern
Method reviewers use decorator-based registration:
@MethodReviewerRegistry.register("experiment")
class ExperimentReviewer(MethodReviewer): ...
This allows extension without modifying package code.
Dimension |
ABC |
Registry |
What it provides |
|---|---|---|---|
Method |
|
|
What to ask + how to read artifacts + domain knowledge |
Data flow
Pipeline context
The orchestrator pipeline flows:
MEASURE ──► EVALUATE ──► ALLOCATE ──► SCALE
The orchestrator passes a job directory reference to Evaluate.execute():
Field |
Type |
Description |
|---|---|---|
|
str |
Path to the job directory containing |
|
float |
Optional override for cost from the orchestrator |
Scorer event contract
load_scorer_event() reads flat top-level keys from impact_results.json:
{
"ci_upper": 15.0,
"effect_estimate": 10.0,
"ci_lower": 5.0,
"cost_to_scale": 100.0,
"sample_size": 50
}
Score output
@dataclass
class ScoreResult:
initiative_id: str
confidence: float # deterministic draw
confidence_range: tuple[float, float] # bounds used
Review input
The ArtifactPayload envelope:
@dataclass
class ArtifactPayload:
initiative_id: str
artifact_text: str # serialized upstream results
model_type: str # methodology label
sample_size: int
metadata: dict # additional context
Review output
@dataclass
class ReviewResult:
initiative_id: str
prompt_name: str # which template was used
prompt_version: str
backend_name: str # which LLM backend
model: str # which model
dimensions: list[ReviewDimension] # per-axis scores
overall_score: float # aggregated (mean of dimensions)
raw_response: str # full LLM output for audit
timestamp: str # ISO-8601
Prompt template contract
Templates are YAML files with Jinja2 content:
name: experiment_review
version: "1.0"
description: "Review experimental impact measurement results"
dimensions:
- randomization_integrity
- specification_adequacy
- statistical_inference
- threats_to_validity
- effect_size_plausibility
system: |
You are a methodological reviewer...
{{ knowledge_context }}
user: |
{{ artifact }}
Model type: {{ model_type }}
The engine uses LiteLLM’s response_format with a Pydantic model
(ReviewResponse) to get structured JSON output directly from the LLM.
The response maps to dimension scores and an overall score without any
text parsing.
Manifest convention
The manifest.json format is a shared convention (not owned by any single
package):
{
"schema_version": "2.0",
"model_type": "experiment",
"evaluate_strategy": "review",
"created_at": "2025-06-01T12:00:00+00:00",
"files": {
"impact_results": {"path": "impact_results.json", "format": "json"}
}
}
The evaluate stage treats the manifest as read-only. Output files are written to the job directory by convention (fixed filenames), not registered in the manifest:
job-impact-engine-XXXX/
├── manifest.json # read-only (created by the producer)
├── impact_results.json # upstream output
├── evaluate_result.json # written by evaluate (both strategies)
├── score_result.json # written by evaluate (score strategy only)
└── review_result.json # written by evaluate (review strategy only)
Dependency strategy
Component |
Core dependency |
|---|---|
Scorer, models |
|
LLM completions |
|
Template rendering |
|
Config / prompt loading |
|
All review dependencies (litellm, jinja2, pyyaml) are core
requirements in pyproject.toml.
Method reviewer packages
Each method reviewer is a self-contained subpackage:
review/methods/experiment/
├── __init__.py
├── reviewer.py # @register("experiment") class
├── templates/
│ └── experiment_review.yaml
└── knowledge/
├── design.md
├── diagnostics.md
└── pitfalls.md
The experiment reviewer evaluates five dimensions:
Dimension |
What it checks |
|---|---|
|
Covariate balance between treatment and control |
|
OLS formula, covariates, functional form |
|
CIs, p-values, F-statistic, multiple testing |
|
Attrition, non-compliance, spillover, SUTVA |
|
Whether the treatment effect is realistic |