{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Deterministic Confidence Scoring\n", "\n", "This tutorial demonstrates the deterministic evaluation path — a lightweight\n", "scorer included for debugging, testing, and illustration. It assigns a\n", "reproducible confidence score to a causal estimate based on the measurement\n", "methodology, without calling an LLM.\n", "\n", "## Workflow overview\n", "\n", "1. Create a mock job directory with `manifest.json` and `impact_results.json`\n", "2. Score directly with `score_initiative()`\n", "3. Score via the `Evaluate` adapter\n", "4. Verify reproducibility across calls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Workflow overview\n", "\n", "1. Create a mock job directory\n", "2. Score directly with `score_confidence()`\n", "3. Score via `evaluate_confidence()`\n", "4. Verify reproducibility\n", "5. Compare confidence ranges across methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initial Setup" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import tempfile\n", "from pathlib import Path\n", "\n", "from notebook_support import print_result_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1 — Create a mock job directory\n", "\n", "An upstream producer writes a job directory containing `manifest.json` (metadata\n", "and file references) and `impact_results.json` (the producer's output). Here\n", "we create one manually to illustrate the convention." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tmp = tempfile.mkdtemp(prefix=\"job-impact-engine-\")\n", "job_dir = Path(tmp)\n", "\n", "manifest = {\n", " \"model_type\": \"experiment\",\n", " \"evaluate_strategy\": \"score\",\n", " \"created_at\": \"2025-06-01T12:00:00+00:00\",\n", " \"files\": {\n", " \"impact_results\": {\"path\": \"impact_results.json\", \"format\": \"json\"},\n", " },\n", "}\n", "\n", "impact_results = {\n", " \"ci_upper\": 15.0,\n", " \"effect_estimate\": 10.0,\n", " \"ci_lower\": 5.0,\n", " \"cost_to_scale\": 100.0,\n", " \"sample_size\": 500,\n", "}\n", "\n", "(job_dir / \"manifest.json\").write_text(json.dumps(manifest, indent=2))\n", "(job_dir / \"impact_results.json\").write_text(json.dumps(impact_results, indent=2))\n", "\n", "print(f\"Job directory: {job_dir}\")\n", "print(f\"Files: {[p.name for p in job_dir.iterdir()]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2 — Score directly with `score_confidence()`\n", "\n", "`score_confidence()` is a pure function useful for debugging and testing. It\n", "takes an `initiative_id` string and a confidence range, hashes the ID to seed\n", "an RNG, and draws a reproducible confidence value. The confidence range comes\n", "from the method reviewer (an experiment uses `(0.85, 1.0)` because RCTs\n", "produce the strongest evidence)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from impact_engine_evaluate.score import score_confidence\n", "\n", "result = score_confidence(\"initiative-demo-001\", confidence_range=(0.85, 1.0))\n", "lo, hi = result.confidence_range\n", "print(f\"Initiative: {result.initiative_id}\")\n", "print(f\"Confidence: {result.confidence:.4f} (range {lo:.2f}–{hi:.2f})\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The confidence score falls within `(0.85, 1.0)` because we specified the\n", "experiment confidence range. The score is deterministic: running the same\n", "`initiative_id` always produces the same value." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3 — Score via `evaluate_confidence()`\n", "\n", "In the full pipeline, the orchestrator calls `evaluate_confidence()` — the\n", "package-level entry point. It reads the manifest, looks up the registered\n", "method reviewer for `model_type`, and dispatches on `evaluate_strategy`.\n", "The result is an `EvaluateResult` dataclass." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dataclasses import asdict\n", "\n", "from impact_engine_evaluate import evaluate_confidence\n", "\n", "result = asdict(evaluate_confidence(config=None, job_dir=str(job_dir)))\n", "\n", "print_result_summary(result)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`evaluate_confidence()` produces the same 5-key output dict. It automatically\n", "read `manifest.json`, found `evaluate_strategy: \"score\"`, and used the\n", "experiment reviewer's confidence range `(0.85, 1.0)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 4 — Verify reproducibility\n", "\n", "The deterministic scorer hashes `initiative_id` to seed a random number\n", "generator. The same ID always produces the same confidence score, regardless\n", "of when or where the code runs." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores = [score_confidence(\"initiative-demo-001\", confidence_range=(0.85, 1.0)).confidence for _ in range(5)]\n", "\n", "print(f\"Scores across 5 calls: {scores}\")\n", "assert len(set(scores)) == 1, \"Scores should be identical\"\n", "print(\"All scores are identical — deterministic scoring is reproducible.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 5 — Compare confidence ranges across methods\n", "\n", "Different measurement methodologies get different confidence ranges. An RCT\n", "deserves higher confidence than a weaker design. The `MethodReviewerRegistry`\n", "exposes the confidence map for all registered methods." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from impact_engine_evaluate.review.methods import MethodReviewerRegistry\n", "\n", "print(\"Registered methods and confidence ranges:\")\n", "for name, bounds in sorted(MethodReviewerRegistry.confidence_map().items()):\n", " print(f\" {name}: ({bounds[0]:.2f}, {bounds[1]:.2f})\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import shutil\n", "\n", "# Clean up\n", "shutil.rmtree(job_dir, ignore_errors=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.0" } }, "nbformat": 4, "nbformat_minor": 4 }