{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deterministic Confidence Scoring\n",
    "\n",
    "This tutorial demonstrates the deterministic evaluation path — a lightweight\n",
    "scorer included for debugging, testing, and illustration. It assigns a\n",
    "reproducible confidence score to a causal estimate based on the measurement\n",
    "methodology, without calling an LLM.\n",
    "\n",
    "## Workflow overview\n",
    "\n",
    "1. Create a mock job directory with `manifest.json` and `impact_results.json`\n",
    "2. Score directly with `score_initiative()`\n",
    "3. Score via the `Evaluate` adapter\n",
    "4. Verify reproducibility across calls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Workflow overview\n",
    "\n",
    "1. Create a mock job directory\n",
    "2. Score directly with `score_confidence()`\n",
    "3. Score via `evaluate_confidence()`\n",
    "4. Verify reproducibility\n",
    "5. Compare confidence ranges across methods"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initial Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import tempfile\n",
    "from pathlib import Path\n",
    "\n",
    "from notebook_support import print_result_summary"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 — Create a mock job directory\n",
    "\n",
    "An upstream producer writes a job directory containing `manifest.json` (metadata\n",
    "and file references) and `impact_results.json` (the producer's output). Here\n",
    "we create one manually to illustrate the convention."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tmp = tempfile.mkdtemp(prefix=\"job-impact-engine-\")\n",
    "job_dir = Path(tmp)\n",
    "\n",
    "manifest = {\n",
    "    \"model_type\": \"experiment\",\n",
    "    \"evaluate_strategy\": \"score\",\n",
    "    \"created_at\": \"2025-06-01T12:00:00+00:00\",\n",
    "    \"files\": {\n",
    "        \"impact_results\": {\"path\": \"impact_results.json\", \"format\": \"json\"},\n",
    "    },\n",
    "}\n",
    "\n",
    "impact_results = {\n",
    "    \"ci_upper\": 15.0,\n",
    "    \"effect_estimate\": 10.0,\n",
    "    \"ci_lower\": 5.0,\n",
    "    \"cost_to_scale\": 100.0,\n",
    "    \"sample_size\": 500,\n",
    "}\n",
    "\n",
    "(job_dir / \"manifest.json\").write_text(json.dumps(manifest, indent=2))\n",
    "(job_dir / \"impact_results.json\").write_text(json.dumps(impact_results, indent=2))\n",
    "\n",
    "print(f\"Job directory: {job_dir}\")\n",
    "print(f\"Files: {[p.name for p in job_dir.iterdir()]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 — Score directly with `score_confidence()`\n",
    "\n",
    "`score_confidence()` is a pure function useful for debugging and testing. It\n",
    "takes an `initiative_id` string and a confidence range, hashes the ID to seed\n",
    "an RNG, and draws a reproducible confidence value. The confidence range comes\n",
    "from the method reviewer (an experiment uses `(0.85, 1.0)` because RCTs\n",
    "produce the strongest evidence)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from impact_engine_evaluate.score import score_confidence\n",
    "\n",
    "result = score_confidence(\"initiative-demo-001\", confidence_range=(0.85, 1.0))\n",
    "lo, hi = result.confidence_range\n",
    "print(f\"Initiative:  {result.initiative_id}\")\n",
    "print(f\"Confidence:  {result.confidence:.4f}  (range {lo:.2f}–{hi:.2f})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The confidence score falls within `(0.85, 1.0)` because we specified the\n",
    "experiment confidence range. The score is deterministic: running the same\n",
    "`initiative_id` always produces the same value."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 — Score via `evaluate_confidence()`\n",
    "\n",
    "In the full pipeline, the orchestrator calls `evaluate_confidence()` — the\n",
    "package-level entry point. It reads the manifest, looks up the registered\n",
    "method reviewer for `model_type`, and dispatches on `evaluate_strategy`.\n",
    "The result is an `EvaluateResult` dataclass."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataclasses import asdict\n",
    "\n",
    "from impact_engine_evaluate import evaluate_confidence\n",
    "\n",
    "result = asdict(evaluate_confidence(config=None, job_dir=str(job_dir)))\n",
    "\n",
    "print_result_summary(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`evaluate_confidence()` produces the same 5-key output dict. It automatically\n",
    "read `manifest.json`, found `evaluate_strategy: \"score\"`, and used the\n",
    "experiment reviewer's confidence range `(0.85, 1.0)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 — Verify reproducibility\n",
    "\n",
    "The deterministic scorer hashes `initiative_id` to seed a random number\n",
    "generator. The same ID always produces the same confidence score, regardless\n",
    "of when or where the code runs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = [score_confidence(\"initiative-demo-001\", confidence_range=(0.85, 1.0)).confidence for _ in range(5)]\n",
    "\n",
    "print(f\"Scores across 5 calls: {scores}\")\n",
    "assert len(set(scores)) == 1, \"Scores should be identical\"\n",
    "print(\"All scores are identical — deterministic scoring is reproducible.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 — Compare confidence ranges across methods\n",
    "\n",
    "Different measurement methodologies get different confidence ranges. An RCT\n",
    "deserves higher confidence than a weaker design. The `MethodReviewerRegistry`\n",
    "exposes the confidence map for all registered methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from impact_engine_evaluate.review.methods import MethodReviewerRegistry\n",
    "\n",
    "print(\"Registered methods and confidence ranges:\")\n",
    "for name, bounds in sorted(MethodReviewerRegistry.confidence_map().items()):\n",
    "    print(f\"  {name}: ({bounds[0]:.2f}, {bounds[1]:.2f})\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import shutil\n",
    "\n",
    "# Clean up\n",
    "shutil.rmtree(job_dir, ignore_errors=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}