Download the notebook here! Interactive online version: colab

Deterministic Confidence Scoring

This tutorial demonstrates the deterministic evaluation path — a lightweight scorer included for debugging, testing, and illustration. It assigns a reproducible confidence score to a causal estimate based on the measurement methodology, without calling an LLM.

Workflow overview

  1. Create a mock job directory with manifest.json and impact_results.json

  2. Score directly with score_initiative()

  3. Score via the Evaluate adapter

  4. Verify reproducibility across calls

Workflow overview

  1. Create a mock job directory

  2. Score directly with score_confidence()

  3. Score via evaluate_confidence()

  4. Verify reproducibility

  5. Compare confidence ranges across methods

Initial Setup

[1]:
import json
import tempfile
from pathlib import Path

from notebook_support import print_result_summary

Step 1 — Create a mock job directory

An upstream producer writes a job directory containing manifest.json (metadata and file references) and impact_results.json (the producer’s output). Here we create one manually to illustrate the convention.

[2]:
tmp = tempfile.mkdtemp(prefix="job-impact-engine-")
job_dir = Path(tmp)

manifest = {
    "model_type": "experiment",
    "evaluate_strategy": "score",
    "created_at": "2025-06-01T12:00:00+00:00",
    "files": {
        "impact_results": {"path": "impact_results.json", "format": "json"},
    },
}

impact_results = {
    "ci_upper": 15.0,
    "effect_estimate": 10.0,
    "ci_lower": 5.0,
    "cost_to_scale": 100.0,
    "sample_size": 500,
}

(job_dir / "manifest.json").write_text(json.dumps(manifest, indent=2))
(job_dir / "impact_results.json").write_text(json.dumps(impact_results, indent=2))

print(f"Job directory: {job_dir}")
print(f"Files: {[p.name for p in job_dir.iterdir()]}")
Job directory: /tmp/job-impact-engine-s3_33vib
Files: ['manifest.json', 'impact_results.json']

Step 2 — Score directly with score_confidence()

score_confidence() is a pure function useful for debugging and testing. It takes an initiative_id string and a confidence range, hashes the ID to seed an RNG, and draws a reproducible confidence value. The confidence range comes from the method reviewer (an experiment uses (0.85, 1.0) because RCTs produce the strongest evidence).

[3]:
from impact_engine_evaluate.score import score_confidence

result = score_confidence("initiative-demo-001", confidence_range=(0.85, 1.0))
lo, hi = result.confidence_range
print(f"Initiative:  {result.initiative_id}")
print(f"Confidence:  {result.confidence:.4f}  (range {lo:.2f}{hi:.2f})")
Initiative:  initiative-demo-001
Confidence:  0.9258  (range 0.85–1.00)

The confidence score falls within (0.85, 1.0) because we specified the experiment confidence range. The score is deterministic: running the same initiative_id always produces the same value.

Step 3 — Score via evaluate_confidence()

In the full pipeline, the orchestrator calls evaluate_confidence() — the package-level entry point. It reads the manifest, looks up the registered method reviewer for model_type, and dispatches on evaluate_strategy. The result is an EvaluateResult dataclass.

[4]:
from dataclasses import asdict

from impact_engine_evaluate import evaluate_confidence

result = asdict(evaluate_confidence(config=None, job_dir=str(job_dir)))

print_result_summary(result)
Initiative:  job-impact-engine-s3_33vib
Strategy:    score
Confidence:  0.9369  (range 0.85–1.00)
Report:      Confidence drawn uniformly between 0.85 and 1.00

evaluate_confidence() produces the same 5-key output dict. It automatically read manifest.json, found evaluate_strategy: "score", and used the experiment reviewer’s confidence range (0.85, 1.0).

Step 4 — Verify reproducibility

The deterministic scorer hashes initiative_id to seed a random number generator. The same ID always produces the same confidence score, regardless of when or where the code runs.

[5]:
scores = [score_confidence("initiative-demo-001", confidence_range=(0.85, 1.0)).confidence for _ in range(5)]

print(f"Scores across 5 calls: {scores}")
assert len(set(scores)) == 1, "Scores should be identical"
print("All scores are identical — deterministic scoring is reproducible.")
Scores across 5 calls: [0.9258105981321656, 0.9258105981321656, 0.9258105981321656, 0.9258105981321656, 0.9258105981321656]
All scores are identical — deterministic scoring is reproducible.

Step 5 — Compare confidence ranges across methods

Different measurement methodologies get different confidence ranges. An RCT deserves higher confidence than a weaker design. The MethodReviewerRegistry exposes the confidence map for all registered methods.

[6]:
from impact_engine_evaluate.review.methods import MethodReviewerRegistry

print("Registered methods and confidence ranges:")
for name, bounds in sorted(MethodReviewerRegistry.confidence_map().items()):
    print(f"  {name}: ({bounds[0]:.2f}, {bounds[1]:.2f})")
Registered methods and confidence ranges:
  experiment: (0.85, 1.00)
  quasi_experimental: (0.60, 0.85)
[7]:
import shutil

# Clean up
shutil.rmtree(job_dir, ignore_errors=True)