# Usage ## Overview Impact Engine Evaluate scores causal effect estimates for reliability. It reads a job directory conforming to the manifest convention and assigns a confidence score that penalizes downstream return estimates in the allocator. The package provides two evaluation strategies. **Agentic review** sends the measurement artifacts to an LLM for structured, per-dimension evaluation. **Deterministic scoring** is a lightweight alternative for debugging, testing, and illustration — it draws a reproducible confidence score from a methodology-specific range without calling an LLM. Both strategies return the same 8-key output dict, making them interchangeable from the allocator's perspective. --- ## Deterministic scoring (debug / test) The deterministic path is useful for debugging, testing, and illustrating the pipeline without an LLM. It requires no external dependencies and assigns confidence based on the measurement methodology alone, without examining the content of the results. ```python from impact_engine_evaluate import score_initiative event = { "initiative_id": "initiative-abc-123", "model_type": "experiment", "ci_upper": 15.0, "effect_estimate": 10.0, "ci_lower": 5.0, "cost_to_scale": 100.0, "sample_size": 500, } result = score_initiative(event, confidence_range=(0.85, 1.0)) ``` `score_initiative()` is a pure function. It hashes the `initiative_id` to seed a random number generator, then draws a confidence value uniformly within the given range. The same `initiative_id` always produces the same score. The `confidence_range` is declared by each registered method reviewer. An experiment (RCT) uses `(0.85, 1.0)` because randomized designs produce the strongest causal evidence. A less rigorous methodology would declare a lower range. The returned `EvaluateResult` contains five fields: | Field | Description | |-------|-------------| | `initiative_id` | Initiative identifier | | `confidence` | Confidence score (0.0–1.0) | | `confidence_range` | `(lower, upper)` bounds from the method reviewer | | `strategy` | Strategy that produced the result (`"score"` or `"review"`) | | `report` | Descriptive string summarising the score | --- ## Agentic review The agentic path sends the actual measurement artifacts to an LLM and parses a structured review with per-dimension scores and justifications. It requires an LLM backend SDK and an API key. ```python from impact_engine_evaluate import evaluate_confidence result = evaluate_confidence("review_config.yaml", "path/to/job-impact-engine-XXXX/") ``` `evaluate_confidence()` performs the following steps: 1. **Read manifest.** Loads `manifest.json` from the job directory to determine the `model_type` and locate artifact files. 2. **Select method reviewer.** Dispatches on `model_type` to a registered `MethodReviewer` (e.g. `"experiment"` selects `ExperimentReviewer`). 3. **Load artifact.** The reviewer reads all files listed in the manifest and serializes them into an `ArtifactPayload`. 4. **Load prompt and knowledge.** The reviewer provides its own prompt template (YAML with Jinja2) and domain knowledge files (Markdown). 5. **Run review.** The `ReviewEngine` renders the prompt, calls the backend, and parses the response into per-dimension scores. 6. **Write results.** Saves `review_result.json` to the job directory. The returned `ReviewResult` contains per-dimension scores, an overall score (the mean), and the raw LLM response for audit. See [Configuration](configuration.md) for backend setup. --- ## Orchestrator integration Within the full pipeline, the orchestrator calls `Evaluate.execute()` rather than invoking `evaluate_confidence()` directly. The adapter reads the manifest, dispatches on `evaluate_strategy`, and returns the common output dict. ```python from impact_engine_evaluate import Evaluate evaluator = Evaluate(config="review_config.yaml") result = evaluator.execute({ "job_dir": "path/to/job-impact-engine-XXXX/", }) ``` The `evaluate_strategy` field in `manifest.json` controls the path: | Strategy | Behavior | |----------|----------| | `"review"` | Runs the full LLM review pipeline | | `"score"` | Lightweight deterministic scorer for debugging and testing | Both strategies produce the same output dict, so the downstream allocator does not need to know which path was used. When the review path runs, the `confidence` value is the LLM-derived `overall_score` from the review rather than a draw from the confidence range. --- ## Pipeline context The orchestrator pipeline flows through four stages: ``` MEASURE ──► EVALUATE ──► ALLOCATE ──► SCALE ``` The upstream stage writes a job directory with `manifest.json` and `impact_results.json`. The evaluate stage reads that directory, scores it, and passes the result to the allocator. Low confidence pulls returns toward worst-case scenarios, making the allocator conservative where evidence is weak and aggressive where evidence is strong.