Download the notebook here! Interactive online version: colab

Local LLM Review with Ollama

This tutorial demonstrates the review evaluation path using a locally hosted model via Ollama. No API key or internet connection is required — the model runs entirely on your machine.

{note} This notebook requires Ollama to be running locally and is not executed during the docs build. Code cells include pre-computed output.

Workflow overview

  1. Inspect the job directory — a synthetic RCT with realistic artifacts

  2. Configure the backend to call ollama_chat/llama3.2

  3. Run review()

  4. Inspect the ReviewResult

  5. Examine the output file

Workflow overview

  1. Inspect the job directory

  2. Configure the backend

  3. Run evaluate_confidence()

  4. Inspect the EvaluateResult

  5. Examine the output file

Initial Setup

Install and start Ollama, then pull a model:

ollama pull llama3.2
ollama serve          # already running if the desktop app is open

No extra Python dependencies are needed beyond the base install:

pip install impact-engine-evaluate

Step 1 — Inspect the job directory

The rct_job/ directory alongside this notebook is a synthetic early-literacy RCT. It contains:

  • manifest.json — metadata, file references, and evaluation strategy

  • impact_results.json — summary statistics (effect estimate, CI, sample size)

  • regression_output.json — full OLS output with balance check and attrition data

[1]:
import json
from pathlib import Path

JOB_DIR = Path("rct_job")

print(f"Job directory: {JOB_DIR}")
print(f"Files: {sorted(p.name for p in JOB_DIR.iterdir())}")
print()
print("manifest.json:")
manifest = json.loads((JOB_DIR / "manifest.json").read_text())
print(json.dumps(manifest, indent=2))
Job directory: rct_job
Files: ['impact_results.json', 'manifest.json', 'regression_output.json', 'review_result.json']

manifest.json:
{
  "schema_version": "2.0",
  "model_type": "experiment",
  "evaluate_strategy": "review",
  "initiative_id": "literacy-rct-2024",
  "created_at": "2025-03-15T09:00:00+00:00",
  "files": {
    "impact_results": {"path": "impact_results.json", "format": "json"},
    "regression_output": {"path": "regression_output.json", "format": "json"}
  }
}

Step 2 — Configure the backend

Create a review_config.yaml file alongside this notebook to specify the model and backend parameters. A copy is provided — inspect it now:

[ ]:
from impact_engine_evaluate.config import load_config

CONFIG_FILE = Path("review_config.yaml")
print(CONFIG_FILE.read_text())

config = load_config(CONFIG_FILE)
print(f"Backend : {config.backend.model}")
print(f"Settings: temperature={config.backend.temperature}, max_tokens={config.backend.max_tokens}")

Step 3 — Run evaluate_confidence()

evaluate_confidence() is the package-level entry point, symmetric with evaluate_impact() in the measure component:

  1. Reads manifest.json and loads the registered ExperimentReviewer

  2. Concatenates all artifact files into a single text payload

  3. Renders the prompt with domain knowledge from knowledge/

  4. Calls the model via litellm and parses the structured JSON response

  5. Writes evaluate_result.json and review_result.json to the job directory

[ ]:
from impact_engine_evaluate import evaluate_confidence

result = evaluate_confidence(CONFIG_FILE, JOB_DIR)
print(f"Review complete. Overall score: {result.confidence:.2f}")

Step 4 — Inspect the EvaluateResult

The result contains the confidence score, strategy used, and a per-dimension breakdown accessible via result.report:

[ ]:
print(f"Initiative  : {result.initiative_id}")
print(f"Strategy    : {result.strategy}")
print(f"Confidence  : {result.confidence:.3f}")
print(f"Range       : [{result.confidence_range[0]:.2f}, {result.confidence_range[1]:.2f}]")
print()
print("Dimensions (from result.report):")
for dim in result.report["dimensions"]:
    bar = "#" * int(dim["score"] * 20)
    print(f"  {dim['name']:<30} {dim['score']:.3f}  |{bar:<20}|")
    print(f"    {dim['justification']}")
    print()

The experiment reviewer evaluates five dimensions:

Dimension

What it checks

randomization_integrity

Attrition, balance, differential dropout

specification_adequacy

OLS formula, covariates, robust SEs

statistical_inference

CIs, p-values, F-statistic, multiple testing

threats_to_validity

Spillover, non-compliance, SUTVA, Hawthorne

effect_size_plausibility

Whether the treatment effect is realistic

Step 5 — Examine the output file

review() writes review_result.json to the job directory alongside the original artifacts. The manifest is treated as read-only.

[5]:
print(f"Job directory contents: {sorted(p.name for p in JOB_DIR.iterdir())}")
print()
review_data = json.loads((JOB_DIR / "review_result.json").read_text())
print(f"review_result.json keys: {list(review_data.keys())}")
print(f"Overall score : {review_data['overall_score']}")
print(f"Dimensions    : {len(review_data['dimensions'])}")
Job directory contents: ['impact_results.json', 'manifest.json', 'regression_output.json', 'review_result.json']

review_result.json keys: ['initiative_id', 'prompt_name', 'prompt_version', 'backend_name', 'model', 'dimensions', 'overall_score', 'raw_response', 'timestamp']
Overall score : 0.75
Dimensions    : 5

The job directory now contains:

rct_job/
├── manifest.json           # read-only (created by the producer)
├── impact_results.json     # summary statistics
├── regression_output.json  # full OLS output
└── review_result.json      # structured review written by evaluate