Design

Motivation

Upstream pipeline stages produce structured artifacts — point estimates, confidence intervals, model diagnostics — that require expert judgement to interpret. Is the effect estimate plausible? Is the model type appropriate for the data? Are the diagnostics healthy?

The evaluate package provides a general-purpose agentic review layer that accepts any job directory conforming to the manifest convention, producing structured, auditable review judgements. A lightweight deterministic scorer is included for debugging, testing, and illustration — it assigns a confidence band based on methodology type alone without examining the content of the results.

Artifacts ──► Review strategy   ──► per-dimension scores + justifications
          └──► Score strategy   ──► confidence score (0–1)  [debug/test]

Architecture

┌─────────────────────────────────────────────────────┐
│                   ReviewEngine                       │
│                                                     │
│  ┌──────────┐   ┌──────────────┐   ┌────────────┐  │
│  │ Backend  │   │ PromptRegistry│   │KnowledgeBase│  │
│  │ Registry │   │ + Renderer   │   │ (optional) │  │
│  └────┬─────┘   └──────┬───────┘   └─────┬──────┘  │
│       │                │                  │         │
│       ▼                ▼                  ▼         │
│  ┌─────────┐   ┌─────────────┐   ┌────────────┐   │
│  │Anthropic│   │  YAML/Jinja │   │   Static   │   │
│  │ OpenAI  │   │  Templates  │   │  Markdown  │   │
│  │ LiteLLM │   └─────────────┘   │   Files    │   │
│  └─────────┘                     └────────────┘   │
└─────────────────────────────────────────────────────┘
         │
         ▼
   ReviewResult
   ├── dimensions[]  (name, score, justification)
   ├── overall_score
   └── raw_response  (audit trail)

Components

Symmetric Evaluate adapter

The Evaluate pipeline component uses symmetric strategy dispatch. Both strategies share the same flow — only the confidence source differs:

manifest → reviewer → scorer_event → [confidence source] → EvaluateResult → write → return
  1. evaluate_strategy (from manifest.json) controls how to compute confidence (score vs review).

  2. model_type selects the MethodReviewer (single source of truth for confidence range, prompt templates, knowledge, artifact loading).

Both strategies construct the same EvaluateResult, write evaluate_result.json to the job directory, and return the same 8-key output dict for downstream allocation. The manifest is treated as read-only.

Each strategy also writes its own strategy-specific result file:

  • Score: score_result.json (ScoreResult — confidence + audit fields)

  • Review: review_result.json (ReviewResult — dimensions + justifications)

MethodReviewer provides a default load_artifact() implementation (reads all manifest files, extracts sample_size from JSON). Subclasses override only when they need method-specific loading.

File

Role

models.py

EvaluateResult dataclass (shared stage output)

score/scorer.py

ScoreResult dataclass + score_confidence() — seeded by initiative_id

job_reader.py

load_scorer_event() — reads impact_results.json and builds a flat scorer event dict

Review subsystem

File

Role

review/models.py

Data models: ReviewResult, ReviewDimension, ReviewResponse, ArtifactPayload, PromptSpec

review/engine.py

ReviewEngine — orchestrates a single review: load prompt, render, call litellm.completion() with structured output

review/api.py

Public review() function — end-to-end review of a job directory

review/manifest.py

Manifest dataclass + load_manifest() (read-only)

review/methods/base.py

MethodReviewer base (default load_artifact) + MethodReviewerRegistry

review/methods/experiment/

Experiment (RCT) reviewer with prompt templates and knowledge

config.py

ReviewConfig — loads from YAML, dict, or env vars

LLM backend

The review engine calls litellm.completion() directly with a Pydantic response_format (ReviewResponse), producing structured JSON that maps directly to dimension scores and an overall score. LiteLLM wraps 100+ providers, so any model supported by LiteLLM can be used by setting the model field in config.


Registry pattern

Method reviewers use decorator-based registration:

@MethodReviewerRegistry.register("experiment")
class ExperimentReviewer(MethodReviewer): ...

This allows extension without modifying package code.

Dimension

ABC

Registry

What it provides

Method

MethodReviewer

MethodReviewerRegistry

What to ask + how to read artifacts + domain knowledge


Data flow

Pipeline context

The orchestrator pipeline flows:

MEASURE ──► EVALUATE ──► ALLOCATE ──► SCALE

The orchestrator passes a job directory reference to Evaluate.execute():

Field

Type

Description

job_dir

str

Path to the job directory containing manifest.json

cost_to_scale

float

Optional override for cost from the orchestrator

Scorer event contract

load_scorer_event() reads flat top-level keys from impact_results.json:

{
  "ci_upper": 15.0,
  "effect_estimate": 10.0,
  "ci_lower": 5.0,
  "cost_to_scale": 100.0,
  "sample_size": 50
}

Score output

@dataclass
class ScoreResult:
    initiative_id: str
    confidence: float              # deterministic draw
    confidence_range: tuple[float, float]  # bounds used

Review input

The ArtifactPayload envelope:

@dataclass
class ArtifactPayload:
    initiative_id: str
    artifact_text: str       # serialized upstream results
    model_type: str          # methodology label
    sample_size: int
    metadata: dict           # additional context

Review output

@dataclass
class ReviewResult:
    initiative_id: str
    prompt_name: str         # which template was used
    prompt_version: str
    backend_name: str        # which LLM backend
    model: str               # which model
    dimensions: list[ReviewDimension]  # per-axis scores
    overall_score: float     # aggregated (mean of dimensions)
    raw_response: str        # full LLM output for audit
    timestamp: str           # ISO-8601

Prompt template contract

Templates are YAML files with Jinja2 content:

name: experiment_review
version: "1.0"
description: "Review experimental impact measurement results"
dimensions:
  - randomization_integrity
  - specification_adequacy
  - statistical_inference
  - threats_to_validity
  - effect_size_plausibility

system: |
  You are a methodological reviewer...
  {{ knowledge_context }}

user: |
  {{ artifact }}
  Model type: {{ model_type }}

The engine uses LiteLLM’s response_format with a Pydantic model (ReviewResponse) to get structured JSON output directly from the LLM. The response maps to dimension scores and an overall score without any text parsing.


Manifest convention

The manifest.json format is a shared convention (not owned by any single package):

{
  "schema_version": "2.0",
  "model_type": "experiment",
  "evaluate_strategy": "review",
  "created_at": "2025-06-01T12:00:00+00:00",
  "files": {
    "impact_results": {"path": "impact_results.json", "format": "json"}
  }
}

The evaluate stage treats the manifest as read-only. Output files are written to the job directory by convention (fixed filenames), not registered in the manifest:

job-impact-engine-XXXX/
├── manifest.json          # read-only (created by the producer)
├── impact_results.json    # upstream output
├── evaluate_result.json   # written by evaluate (both strategies)
├── score_result.json      # written by evaluate (score strategy only)
└── review_result.json     # written by evaluate (review strategy only)

Dependency strategy

Component

Core dependency

Scorer, models

numpy

LLM completions

litellm

Template rendering

jinja2

Config / prompt loading

pyyaml

All review dependencies (litellm, jinja2, pyyaml) are core requirements in pyproject.toml.


Method reviewer packages

Each method reviewer is a self-contained subpackage:

review/methods/experiment/
├── __init__.py
├── reviewer.py              # @register("experiment") class
├── templates/
│   └── experiment_review.yaml
└── knowledge/
    ├── design.md
    ├── diagnostics.md
    └── pitfalls.md

The experiment reviewer evaluates five dimensions:

Dimension

What it checks

randomization_integrity

Covariate balance between treatment and control

specification_adequacy

OLS formula, covariates, functional form

statistical_inference

CIs, p-values, F-statistic, multiple testing

threats_to_validity

Attrition, non-compliance, spillover, SUTVA

effect_size_plausibility

Whether the treatment effect is realistic