Download the notebook here!
Interactive online version:
Deterministic Confidence Scoring
This tutorial demonstrates the deterministic evaluation path — a lightweight scorer included for debugging, testing, and illustration. It assigns a reproducible confidence score to a causal estimate based on the measurement methodology, without calling an LLM.
Workflow overview
Create a mock job directory with
manifest.jsonandimpact_results.jsonScore directly with
score_initiative()Score via the
EvaluateadapterVerify reproducibility across calls
Workflow overview
Create a mock job directory
Score directly with
score_confidence()Score via
evaluate_confidence()Verify reproducibility
Compare confidence ranges across methods
Initial Setup
[1]:
import json
import tempfile
from pathlib import Path
from notebook_support import print_result_summary
Step 1 — Create a mock job directory
An upstream producer writes a job directory containing manifest.json (metadata and file references) and impact_results.json (the producer’s output). Here we create one manually to illustrate the convention.
[2]:
tmp = tempfile.mkdtemp(prefix="job-impact-engine-")
job_dir = Path(tmp)
manifest = {
"model_type": "experiment",
"evaluate_strategy": "score",
"created_at": "2025-06-01T12:00:00+00:00",
"files": {
"impact_results": {"path": "impact_results.json", "format": "json"},
},
}
impact_results = {
"ci_upper": 15.0,
"effect_estimate": 10.0,
"ci_lower": 5.0,
"cost_to_scale": 100.0,
"sample_size": 500,
}
(job_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
(job_dir / "impact_results.json").write_text(json.dumps(impact_results, indent=2))
print(f"Job directory: {job_dir}")
print(f"Files: {[p.name for p in job_dir.iterdir()]}")
Job directory: /tmp/job-impact-engine-s3_33vib
Files: ['manifest.json', 'impact_results.json']
Step 2 — Score directly with score_confidence()
score_confidence() is a pure function useful for debugging and testing. It takes an initiative_id string and a confidence range, hashes the ID to seed an RNG, and draws a reproducible confidence value. The confidence range comes from the method reviewer (an experiment uses (0.85, 1.0) because RCTs produce the strongest evidence).
[3]:
from impact_engine_evaluate.score import score_confidence
result = score_confidence("initiative-demo-001", confidence_range=(0.85, 1.0))
lo, hi = result.confidence_range
print(f"Initiative: {result.initiative_id}")
print(f"Confidence: {result.confidence:.4f} (range {lo:.2f}–{hi:.2f})")
Initiative: initiative-demo-001
Confidence: 0.9258 (range 0.85–1.00)
The confidence score falls within (0.85, 1.0) because we specified the experiment confidence range. The score is deterministic: running the same initiative_id always produces the same value.
Step 3 — Score via evaluate_confidence()
In the full pipeline, the orchestrator calls evaluate_confidence() — the package-level entry point. It reads the manifest, looks up the registered method reviewer for model_type, and dispatches on evaluate_strategy. The result is an EvaluateResult dataclass.
[4]:
from dataclasses import asdict
from impact_engine_evaluate import evaluate_confidence
result = asdict(evaluate_confidence(config=None, job_dir=str(job_dir)))
print_result_summary(result)
Initiative: job-impact-engine-s3_33vib
Strategy: score
Confidence: 0.9369 (range 0.85–1.00)
Report: Confidence drawn uniformly between 0.85 and 1.00
evaluate_confidence() produces the same 5-key output dict. It automatically read manifest.json, found evaluate_strategy: "score", and used the experiment reviewer’s confidence range (0.85, 1.0).
Step 4 — Verify reproducibility
The deterministic scorer hashes initiative_id to seed a random number generator. The same ID always produces the same confidence score, regardless of when or where the code runs.
[5]:
scores = [score_confidence("initiative-demo-001", confidence_range=(0.85, 1.0)).confidence for _ in range(5)]
print(f"Scores across 5 calls: {scores}")
assert len(set(scores)) == 1, "Scores should be identical"
print("All scores are identical — deterministic scoring is reproducible.")
Scores across 5 calls: [0.9258105981321656, 0.9258105981321656, 0.9258105981321656, 0.9258105981321656, 0.9258105981321656]
All scores are identical — deterministic scoring is reproducible.
Step 5 — Compare confidence ranges across methods
Different measurement methodologies get different confidence ranges. An RCT deserves higher confidence than a weaker design. The MethodReviewerRegistry exposes the confidence map for all registered methods.
[6]:
from impact_engine_evaluate.review.methods import MethodReviewerRegistry
print("Registered methods and confidence ranges:")
for name, bounds in sorted(MethodReviewerRegistry.confidence_map().items()):
print(f" {name}: ({bounds[0]:.2f}, {bounds[1]:.2f})")
Registered methods and confidence ranges:
experiment: (0.85, 1.00)
quasi_experimental: (0.60, 0.85)
[7]:
import shutil
# Clean up
shutil.rmtree(job_dir, ignore_errors=True)