# Design ## Motivation Upstream pipeline stages produce structured artifacts — point estimates, confidence intervals, model diagnostics — that require expert judgement to interpret. Is the effect estimate plausible? Is the model type appropriate for the data? Are the diagnostics healthy? The evaluate package provides a general-purpose agentic review layer that accepts any job directory conforming to the manifest convention, producing structured, auditable review judgements. A lightweight deterministic scorer is included for debugging, testing, and illustration — it assigns a confidence band based on methodology type alone without examining the content of the results. ``` Artifacts ──► Review strategy ──► per-dimension scores + justifications └──► Score strategy ──► confidence score (0–1) [debug/test] ``` --- ## Architecture ``` ┌─────────────────────────────────────────────────────┐ │ ReviewEngine │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌────────────┐ │ │ │ Backend │ │ PromptRegistry│ │KnowledgeBase│ │ │ │ Registry │ │ + Renderer │ │ (optional) │ │ │ └────┬─────┘ └──────┬───────┘ └─────┬──────┘ │ │ │ │ │ │ │ ▼ ▼ ▼ │ │ ┌─────────┐ ┌─────────────┐ ┌────────────┐ │ │ │Anthropic│ │ YAML/Jinja │ │ Static │ │ │ │ OpenAI │ │ Templates │ │ Markdown │ │ │ │ LiteLLM │ └─────────────┘ │ Files │ │ │ └─────────┘ └────────────┘ │ └─────────────────────────────────────────────────────┘ │ ▼ ReviewResult ├── dimensions[] (name, score, justification) ├── overall_score └── raw_response (audit trail) ``` --- ## Components ### Symmetric `Evaluate` adapter The `Evaluate` pipeline component uses symmetric strategy dispatch. Both strategies share the **same flow** — only the confidence source differs: ``` manifest → reviewer → scorer_event → [confidence source] → EvaluateResult → write → return ``` 1. `evaluate_strategy` (from `manifest.json`) controls *how* to compute confidence (score vs review). 2. `model_type` selects the `MethodReviewer` (single source of truth for confidence range, prompt templates, knowledge, artifact loading). Both strategies construct the same `EvaluateResult`, write `evaluate_result.json` to the job directory, and return the same 8-key output dict for downstream allocation. The manifest is treated as read-only. Each strategy also writes its own strategy-specific result file: - Score: `score_result.json` (`ScoreResult` — confidence + audit fields) - Review: `review_result.json` (`ReviewResult` — dimensions + justifications) `MethodReviewer` provides a default `load_artifact()` implementation (reads all manifest files, extracts `sample_size` from JSON). Subclasses override only when they need method-specific loading. | File | Role | |------|------| | `models.py` | `EvaluateResult` dataclass (shared stage output) | | `score/scorer.py` | `ScoreResult` dataclass + `score_confidence()` — seeded by `initiative_id` | | `job_reader.py` | `load_scorer_event()` — reads `impact_results.json` and builds a flat scorer event dict | ### Review subsystem | File | Role | |------|------| | `review/models.py` | Data models: `ReviewResult`, `ReviewDimension`, `ReviewResponse`, `ArtifactPayload`, `PromptSpec` | | `review/engine.py` | `ReviewEngine` — orchestrates a single review: load prompt, render, call `litellm.completion()` with structured output | | `review/api.py` | Public `review()` function — end-to-end review of a job directory | | `review/manifest.py` | `Manifest` dataclass + `load_manifest()` (read-only) | | `review/methods/base.py` | `MethodReviewer` base (default `load_artifact`) + `MethodReviewerRegistry` | | `review/methods/experiment/` | Experiment (RCT) reviewer with prompt templates and knowledge | | `config.py` | `ReviewConfig` — loads from YAML, dict, or env vars | ### LLM backend The review engine calls `litellm.completion()` directly with a Pydantic `response_format` (`ReviewResponse`), producing structured JSON that maps directly to dimension scores and an overall score. LiteLLM wraps 100+ providers, so any model supported by LiteLLM can be used by setting the `model` field in config. --- ## Registry pattern Method reviewers use decorator-based registration: ```python @MethodReviewerRegistry.register("experiment") class ExperimentReviewer(MethodReviewer): ... ``` This allows extension without modifying package code. | Dimension | ABC | Registry | What it provides | |-----------|-----|----------|-----------------| | **Method** | `MethodReviewer` | `MethodReviewerRegistry` | *What* to ask + how to read artifacts + domain knowledge | --- ## Data flow ### Pipeline context The orchestrator pipeline flows: ``` MEASURE ──► EVALUATE ──► ALLOCATE ──► SCALE ``` The orchestrator passes a job directory reference to `Evaluate.execute()`: | Field | Type | Description | |-------|------|-------------| | `job_dir` | str | Path to the job directory containing `manifest.json` | | `cost_to_scale` | float | Optional override for cost from the orchestrator | ### Scorer event contract `load_scorer_event()` reads flat top-level keys from `impact_results.json`: ```json { "ci_upper": 15.0, "effect_estimate": 10.0, "ci_lower": 5.0, "cost_to_scale": 100.0, "sample_size": 50 } ``` ### Score output ```python @dataclass class ScoreResult: initiative_id: str confidence: float # deterministic draw confidence_range: tuple[float, float] # bounds used ``` ### Review input The `ArtifactPayload` envelope: ```python @dataclass class ArtifactPayload: initiative_id: str artifact_text: str # serialized upstream results model_type: str # methodology label sample_size: int metadata: dict # additional context ``` ### Review output ```python @dataclass class ReviewResult: initiative_id: str prompt_name: str # which template was used prompt_version: str backend_name: str # which LLM backend model: str # which model dimensions: list[ReviewDimension] # per-axis scores overall_score: float # aggregated (mean of dimensions) raw_response: str # full LLM output for audit timestamp: str # ISO-8601 ``` --- ## Prompt template contract Templates are YAML files with Jinja2 content: ```yaml name: experiment_review version: "1.0" description: "Review experimental impact measurement results" dimensions: - randomization_integrity - specification_adequacy - statistical_inference - threats_to_validity - effect_size_plausibility system: | You are a methodological reviewer... {{ knowledge_context }} user: | {{ artifact }} Model type: {{ model_type }} ``` The engine uses LiteLLM's `response_format` with a Pydantic model (`ReviewResponse`) to get structured JSON output directly from the LLM. The response maps to dimension scores and an overall score without any text parsing. --- ## Manifest convention The `manifest.json` format is a shared convention (not owned by any single package): ```json { "schema_version": "2.0", "model_type": "experiment", "evaluate_strategy": "review", "created_at": "2025-06-01T12:00:00+00:00", "files": { "impact_results": {"path": "impact_results.json", "format": "json"} } } ``` The evaluate stage treats the manifest as **read-only**. Output files are written to the job directory by convention (fixed filenames), not registered in the manifest: ``` job-impact-engine-XXXX/ ├── manifest.json # read-only (created by the producer) ├── impact_results.json # upstream output ├── evaluate_result.json # written by evaluate (both strategies) ├── score_result.json # written by evaluate (score strategy only) └── review_result.json # written by evaluate (review strategy only) ``` --- ## Dependency strategy | Component | Core dependency | |-----------|----------------| | Scorer, models | `numpy` | | LLM completions | `litellm` | | Template rendering | `jinja2` | | Config / prompt loading | `pyyaml` | All review dependencies (`litellm`, `jinja2`, `pyyaml`) are core requirements in `pyproject.toml`. --- ## Method reviewer packages Each method reviewer is a self-contained subpackage: ``` review/methods/experiment/ ├── __init__.py ├── reviewer.py # @register("experiment") class ├── templates/ │ └── experiment_review.yaml └── knowledge/ ├── design.md ├── diagnostics.md └── pitfalls.md ``` The experiment reviewer evaluates five dimensions: | Dimension | What it checks | |-----------|---------------| | `randomization_integrity` | Covariate balance between treatment and control | | `specification_adequacy` | OLS formula, covariates, functional form | | `statistical_inference` | CIs, p-values, F-statistic, multiple testing | | `threats_to_validity` | Attrition, non-compliance, spillover, SUTVA | | `effect_size_plausibility` | Whether the treatment effect is realistic |