Usage
Overview
Impact Engine Evaluate scores causal effect estimates for reliability. It reads a job directory conforming to the manifest convention and assigns a confidence score that penalizes downstream return estimates in the allocator.
The package provides two evaluation strategies. Agentic review sends the measurement artifacts to an LLM for structured, per-dimension evaluation. Deterministic scoring is a lightweight alternative for debugging, testing, and illustration — it draws a reproducible confidence score from a methodology-specific range without calling an LLM. Both strategies return the same 8-key output dict, making them interchangeable from the allocator’s perspective.
Deterministic scoring (debug / test)
The deterministic path is useful for debugging, testing, and illustrating the pipeline without an LLM. It requires no external dependencies and assigns confidence based on the measurement methodology alone, without examining the content of the results.
from impact_engine_evaluate import score_initiative
event = {
"initiative_id": "initiative-abc-123",
"model_type": "experiment",
"ci_upper": 15.0,
"effect_estimate": 10.0,
"ci_lower": 5.0,
"cost_to_scale": 100.0,
"sample_size": 500,
}
result = score_initiative(event, confidence_range=(0.85, 1.0))
score_initiative() is a pure function. It hashes the initiative_id to seed
a random number generator, then draws a confidence value uniformly within the
given range. The same initiative_id always produces the same score.
The confidence_range is declared by each registered method reviewer. An
experiment (RCT) uses (0.85, 1.0) because randomized designs produce the
strongest causal evidence. A less rigorous methodology would declare a lower
range.
The returned EvaluateResult contains five fields:
Field |
Description |
|---|---|
|
Initiative identifier |
|
Confidence score (0.0–1.0) |
|
|
|
Strategy that produced the result ( |
|
Descriptive string summarising the score |
Agentic review
The agentic path sends the actual measurement artifacts to an LLM and parses a structured review with per-dimension scores and justifications. It requires an LLM backend SDK and an API key.
from impact_engine_evaluate import evaluate_confidence
result = evaluate_confidence("review_config.yaml", "path/to/job-impact-engine-XXXX/")
evaluate_confidence() performs the following steps:
Read manifest. Loads
manifest.jsonfrom the job directory to determine themodel_typeand locate artifact files.Select method reviewer. Dispatches on
model_typeto a registeredMethodReviewer(e.g."experiment"selectsExperimentReviewer).Load artifact. The reviewer reads all files listed in the manifest and serializes them into an
ArtifactPayload.Load prompt and knowledge. The reviewer provides its own prompt template (YAML with Jinja2) and domain knowledge files (Markdown).
Run review. The
ReviewEnginerenders the prompt, calls the backend, and parses the response into per-dimension scores.Write results. Saves
review_result.jsonto the job directory.
The returned ReviewResult contains per-dimension scores, an overall score (the
mean), and the raw LLM response for audit. See Configuration
for backend setup.
Orchestrator integration
Within the full pipeline, the orchestrator calls Evaluate.execute() rather
than invoking evaluate_confidence() directly. The adapter reads the
manifest, dispatches on evaluate_strategy, and returns the common output dict.
from impact_engine_evaluate import Evaluate
evaluator = Evaluate(config="review_config.yaml")
result = evaluator.execute({
"job_dir": "path/to/job-impact-engine-XXXX/",
})
The evaluate_strategy field in manifest.json controls the path:
Strategy |
Behavior |
|---|---|
|
Runs the full LLM review pipeline |
|
Lightweight deterministic scorer for debugging and testing |
Both strategies produce the same output dict, so the downstream allocator
does not need to know which path was used. When the review path runs, the
confidence value is the LLM-derived overall_score from the review rather
than a draw from the confidence range.
Pipeline context
The orchestrator pipeline flows through four stages:
MEASURE ──► EVALUATE ──► ALLOCATE ──► SCALE
The upstream stage writes a job directory with manifest.json and
impact_results.json. The evaluate stage reads that directory, scores it, and
passes the result to the allocator. Low confidence pulls returns toward
worst-case scenarios, making the allocator conservative where evidence is weak
and aggressive where evidence is strong.