Configuration

Impact Engine uses YAML configuration files to control all aspects of data sourcing, measurement, and output. This guide documents the actual configuration schema as implemented in the code.

Configuration structure

The engine uses YAML configuration files with three main sections.

DATA:
  SOURCE:
    # Data source configuration
  TRANSFORM:
    # Optional data transformation

MEASUREMENT:
  # Model configuration

OUTPUT:
  # Output path configuration

DATA section

Configures where metrics data comes from and how it’s transformed.

SOURCE configuration

Parameter	Type	Required	Description
`type`	string	No	Data source type: `"simulator"` (default)
`CONFIG`	object	Yes	Source-specific configuration

Simulator CONFIG parameters (default)

The simulator generates synthetic metrics data from a product catalog.

DATA:
  SOURCE:
    type: simulator
    CONFIG:
      mode: rule
      seed: 42
      path: data/products.csv
      start_date: "2024-01-01"
      end_date: "2024-01-31"

Parameter	Type	Required	Default	Description
`path`	string	Yes	-	Path to products CSV file
`start_date`	string	Yes	-	Analysis start date (YYYY-MM-DD)
`end_date`	string	Yes	-	Analysis end date (YYYY-MM-DD)
`mode`	string	No	`"rule"`	Simulation mode: `"rule"` (deterministic)
`seed`	int	No	`42`	Random seed for reproducibility

File CONFIG parameters

Load metrics from an existing CSV or Parquet file instead of simulating.

DATA:
  SOURCE:
    type: file
    CONFIG:
      path: data/metrics.csv
      product_id_column: product_id
      date_column: date

Parameter	Type	Required	Default	Description
`path`	string	Yes	-	Path to data file (CSV, Parquet, or partitioned Parquet directory)
`product_id_column`	string	No	`"product_id"`	Column name for product identifiers
`date_column`	string	No	`"date"`	Column name for dates

Enrichment configuration

Apply synthetic interventions to simulated data for testing causal impact detection.

DATA:
  SOURCE:
    type: simulator
    CONFIG:
      mode: rule
      seed: 42
      path: data/products.csv
      start_date: "2024-11-01"
      end_date: "2024-12-15"
  ENRICHMENT:
    FUNCTION: product_detail_boost
    PARAMS:
      quality_boost: 0.15
      enrichment_fraction: 1.0
      enrichment_start: "2024-11-23"
      seed: 42

Parameter	Type	Required	Description
`ENRICHMENT.FUNCTION`	string	Yes	Enrichment function: `"product_detail_boost"`
`ENRICHMENT.PARAMS.quality_boost`	float	Yes	Magnitude of the quality score boost (e.g., 0.15)
`ENRICHMENT.PARAMS.enrichment_fraction`	float	No	Fraction of products to enrich (0.0-1.0, default 1.0)
`ENRICHMENT.PARAMS.enrichment_start`	string	Yes	Date when enrichment begins (YYYY-MM-DD)
`ENRICHMENT.PARAMS.seed`	int	No	Random seed for reproducibility

TRANSFORM configuration

Optional transformation applied to data before model fitting.

DATA:
  TRANSFORM:
    FUNCTION: aggregate_by_date
    PARAMS:
      metric: revenue

Parameter	Type	Required	Default	Description
`FUNCTION`	string	No	`"passthrough"`	Transform function name
`PARAMS`	object	No	`{}`	Function-specific parameters

Available transforms

Each model typically pairs with a specific transform. The engine selects the transform by name from a registry.

Transform	Used With	Description	Key Parameters
`passthrough`	Any	No-op default. Passes data through unchanged.	None
`aggregate_by_date`	Interrupted Time Series	Sums all numeric columns by date, producing one row per date.	`metric`: column to validate exists (default `"revenue"`)
`prepare_for_synthetic_control`	Synthetic Control	Adds a `treatment` column derived from enrichment status and date.	`enrichment_start`: date when enrichment began (auto-injected from ENRICHMENT.PARAMS)
`aggregate_for_approximation`	Metrics Approximation	Aggregates baseline metric per product into cross-sectional format.	`baseline_metric`: column to aggregate (default `"revenue"`)
`prepare_simulator_for_approximation`	Metrics Approximation (simulator source)	Converts simulator time-series into before/after quality scores and baseline sales per product.	`enrichment_start`: date split point (required), `baseline_metric`: column to aggregate (default `"revenue"`)

MEASUREMENT section

Configures the statistical model for impact analysis.

Common parameters

Parameter	Type	Required	Description
`MODEL`	string	No	Model type (default: `"interrupted_time_series"`)
`PARAMS`	object	Yes	Model-specific parameters

Interrupted time series model

MEASUREMENT:
  MODEL: interrupted_time_series
  PARAMS:
    intervention_date: "2024-01-15"
    dependent_variable: revenue
    order: [1, 0, 0]
    seasonal_order: [0, 0, 0, 0]

Parameter	Type	Required	Default	Description
`intervention_date`	string	Yes	-	Date when intervention occurred (YYYY-MM-DD)
`dependent_variable`	string	No	`"revenue"`	Column name to analyze
`order`	array	No	`[1, 0, 0]`	ARIMA order (p, d, q)
`seasonal_order`	array	No	`[0, 0, 0, 0]`	Seasonal ARIMA order (P, D, Q, s)

Experiment model

MEASUREMENT:
  MODEL: experiment
  PARAMS:
    formula: "revenue ~ treatment + price"

Parameter	Type	Required	Default	Description
`formula`	string	Yes	-	R-style formula where all variables must exist in the DataFrame

Subclassification model

MEASUREMENT:
  MODEL: subclassification
  PARAMS:
    dependent_variable: revenue
    treatment_column: treatment
    covariate_columns:
      - price
    n_strata: 5
    estimand: att

Parameter	Type	Required	Default	Description
`dependent_variable`	string	No	`"revenue"`	Outcome column name
`treatment_column`	string	Yes	-	Binary treatment indicator column
`covariate_columns`	list	Yes	-	Columns used for propensity stratification
`n_strata`	int	No	`5`	Number of quantile-based strata
`estimand`	string	No	`"att"`	Estimand: `"att"` or `"ate"`

Nearest neighbour matching model

MEASUREMENT:
  MODEL: nearest_neighbour_matching
  PARAMS:
    dependent_variable: revenue
    treatment_column: treatment
    covariate_columns:
      - price
    caliper: 0.2
    replace: false
    ratio: 1

Parameter	Type	Required	Default	Description
`dependent_variable`	string	No	`"revenue"`	Outcome column name
`treatment_column`	string	Yes	-	Binary treatment indicator column
`covariate_columns`	list	Yes	-	Columns used for matching
`caliper`	float	No	`0.2`	Maximum distance for a valid match
`replace`	bool	No	`false`	Whether to match with replacement
`ratio`	int	No	`1`	Number of matches per unit
`shuffle`	bool	No	`true`	Shuffle data before matching
`random_state`	int	No	`null`	Random seed for reproducibility
`n_jobs`	int	No	`1`	Number of parallel jobs

Metrics approximation model

MEASUREMENT:
  MODEL: metrics_approximation
  PARAMS:
    metric_before_column: quality_before
    metric_after_column: quality_after
    baseline_column: baseline_sales
    RESPONSE:
      FUNCTION: linear
      PARAMS:
        coefficient: 0.5

Parameter	Type	Required	Default	Description
`metric_before_column`	string	No	`"quality_before"`	Column name for pre-intervention metric
`metric_after_column`	string	No	`"quality_after"`	Column name for post-intervention metric
`baseline_column`	string	No	`"baseline_sales"`	Column name for baseline outcome
`RESPONSE.FUNCTION`	string	No	`"linear"`	Response function name from the response registry
`RESPONSE.PARAMS.coefficient`	float	No	`0.5`	Coefficient for the linear response function

Synthetic control model

MEASUREMENT:
  MODEL: synthetic_control
  PARAMS:
    treatment_time: 15
    treated_unit: "unit_A"
    outcome_column: revenue
    unit_column: unit_id
    time_column: date

Parameter	Type	Required	Default	Description
`treatment_time`	int	Yes	-	Time index when intervention occurred
`treated_unit`	string	Yes	-	Name of the treated unit
`outcome_column`	string	Yes	-	Column with the outcome variable
`unit_column`	string	No	`"unit_id"`	Column identifying units
`time_column`	string	No	`"date"`	Column identifying time periods
`optim_method`	string	No	`"Nelder-Mead"`	Optimization method passed to pysyncon
`optim_initial`	string	No	`"equal"`	Initial weight strategy for optimization

OUTPUT section

Configures where results are stored.

OUTPUT:
  PATH: output

Parameter	Type	Required	Default	Description
`PATH`	string	No	`"output"`	Directory for output files

Complete example

DATA:
  SOURCE:
    type: simulator
    CONFIG:
      mode: rule
      seed: 42
      path: data/products.csv
      start_date: "2024-01-01"
      end_date: "2024-03-31"
  TRANSFORM:
    FUNCTION: aggregate_by_date
    PARAMS:
      metric: revenue

MEASUREMENT:
  MODEL: interrupted_time_series
  PARAMS:
    intervention_date: "2024-02-01"
    dependent_variable: revenue
    order: [1, 0, 0]
    seasonal_order: [0, 0, 0, 7]

OUTPUT:
  PATH: output