Download the notebook here!
Interactive online version:
Local LLM Review with Ollama
This tutorial demonstrates the review evaluation path using a locally hosted model via Ollama. No API key or internet connection is required — the model runs entirely on your machine.
{note} This notebook requires Ollama to be running locally and is not executed during the docs build. Code cells include pre-computed output.
Workflow overview
Inspect the job directory — a synthetic RCT with realistic artifacts
Configure the backend to call
ollama_chat/llama3.2Run
review()Inspect the
ReviewResultExamine the output file
Workflow overview
Inspect the job directory
Configure the backend
Run
evaluate_confidence()Inspect the
EvaluateResultExamine the output file
Initial Setup
Install and start Ollama, then pull a model:
ollama pull llama3.2
ollama serve # already running if the desktop app is open
No extra Python dependencies are needed beyond the base install:
pip install impact-engine-evaluate
Step 1 — Inspect the job directory
The rct_job/ directory alongside this notebook is a synthetic early-literacy RCT. It contains:
manifest.json— metadata, file references, and evaluation strategyimpact_results.json— summary statistics (effect estimate, CI, sample size)regression_output.json— full OLS output with balance check and attrition data
[1]:
import json
from pathlib import Path
JOB_DIR = Path("rct_job")
print(f"Job directory: {JOB_DIR}")
print(f"Files: {sorted(p.name for p in JOB_DIR.iterdir())}")
print()
print("manifest.json:")
manifest = json.loads((JOB_DIR / "manifest.json").read_text())
print(json.dumps(manifest, indent=2))
Job directory: rct_job
Files: ['impact_results.json', 'manifest.json', 'regression_output.json', 'review_result.json']
manifest.json:
{
"schema_version": "2.0",
"model_type": "experiment",
"evaluate_strategy": "review",
"initiative_id": "literacy-rct-2024",
"created_at": "2025-03-15T09:00:00+00:00",
"files": {
"impact_results": {"path": "impact_results.json", "format": "json"},
"regression_output": {"path": "regression_output.json", "format": "json"}
}
}
Step 2 — Configure the backend
Create a review_config.yaml file alongside this notebook to specify the model and backend parameters. A copy is provided — inspect it now:
[ ]:
from impact_engine_evaluate.config import load_config
CONFIG_FILE = Path("review_config.yaml")
print(CONFIG_FILE.read_text())
config = load_config(CONFIG_FILE)
print(f"Backend : {config.backend.model}")
print(f"Settings: temperature={config.backend.temperature}, max_tokens={config.backend.max_tokens}")
Step 3 — Run evaluate_confidence()
evaluate_confidence() is the package-level entry point, symmetric with evaluate_impact() in the measure component:
Reads
manifest.jsonand loads the registeredExperimentReviewerConcatenates all artifact files into a single text payload
Renders the prompt with domain knowledge from
knowledge/Calls the model via litellm and parses the structured JSON response
Writes
evaluate_result.jsonandreview_result.jsonto the job directory
[ ]:
from impact_engine_evaluate import evaluate_confidence
result = evaluate_confidence(CONFIG_FILE, JOB_DIR)
print(f"Review complete. Overall score: {result.confidence:.2f}")
Step 4 — Inspect the EvaluateResult
The result contains the confidence score, strategy used, and a per-dimension breakdown accessible via result.report:
[ ]:
print(f"Initiative : {result.initiative_id}")
print(f"Strategy : {result.strategy}")
print(f"Confidence : {result.confidence:.3f}")
print(f"Range : [{result.confidence_range[0]:.2f}, {result.confidence_range[1]:.2f}]")
print()
print("Dimensions (from result.report):")
for dim in result.report["dimensions"]:
bar = "#" * int(dim["score"] * 20)
print(f" {dim['name']:<30} {dim['score']:.3f} |{bar:<20}|")
print(f" {dim['justification']}")
print()
The experiment reviewer evaluates five dimensions:
Dimension |
What it checks |
|---|---|
|
Attrition, balance, differential dropout |
|
OLS formula, covariates, robust SEs |
|
CIs, p-values, F-statistic, multiple testing |
|
Spillover, non-compliance, SUTVA, Hawthorne |
|
Whether the treatment effect is realistic |
Step 5 — Examine the output file
review() writes review_result.json to the job directory alongside the original artifacts. The manifest is treated as read-only.
[5]:
print(f"Job directory contents: {sorted(p.name for p in JOB_DIR.iterdir())}")
print()
review_data = json.loads((JOB_DIR / "review_result.json").read_text())
print(f"review_result.json keys: {list(review_data.keys())}")
print(f"Overall score : {review_data['overall_score']}")
print(f"Dimensions : {len(review_data['dimensions'])}")
Job directory contents: ['impact_results.json', 'manifest.json', 'regression_output.json', 'review_result.json']
review_result.json keys: ['initiative_id', 'prompt_name', 'prompt_version', 'backend_name', 'model', 'dimensions', 'overall_score', 'raw_response', 'timestamp']
Overall score : 0.75
Dimensions : 5
The job directory now contains:
rct_job/
├── manifest.json # read-only (created by the producer)
├── impact_results.json # summary statistics
├── regression_output.json # full OLS output
└── review_result.json # structured review written by evaluate