# Configuration Impact Engine uses YAML configuration files to control all aspects of data sourcing, measurement, and output. This guide documents the **actual configuration schema** as implemented in the code. ## Configuration structure The engine uses YAML configuration files with three main sections. ```yaml DATA: SOURCE: # Data source configuration TRANSFORM: # Optional data transformation MEASUREMENT: # Model configuration OUTPUT: # Output path configuration ``` ## DATA section Configures where metrics data comes from and how it's transformed. ### SOURCE configuration | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `type` | string | No | Data source type: `"simulator"` (default) | | `CONFIG` | object | Yes | Source-specific configuration | ### Simulator CONFIG parameters (default) The simulator generates synthetic metrics data from a product catalog. ```yaml DATA: SOURCE: type: simulator CONFIG: mode: rule seed: 42 path: data/products.csv start_date: "2024-01-01" end_date: "2024-01-31" ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `path` | string | Yes | - | Path to products CSV file | | `start_date` | string | Yes | - | Analysis start date (YYYY-MM-DD) | | `end_date` | string | Yes | - | Analysis end date (YYYY-MM-DD) | | `mode` | string | No | `"rule"` | Simulation mode: `"rule"` (deterministic) | | `seed` | int | No | `42` | Random seed for reproducibility | ### File CONFIG parameters Load metrics from an existing CSV or Parquet file instead of simulating. ```yaml DATA: SOURCE: type: file CONFIG: path: data/metrics.csv product_id_column: product_id date_column: date ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `path` | string | Yes | - | Path to data file (CSV, Parquet, or partitioned Parquet directory) | | `product_id_column` | string | No | `"product_id"` | Column name for product identifiers | | `date_column` | string | No | `"date"` | Column name for dates | ### Enrichment configuration Apply synthetic interventions to simulated data for testing causal impact detection. ```yaml DATA: SOURCE: type: simulator CONFIG: mode: rule seed: 42 path: data/products.csv start_date: "2024-11-01" end_date: "2024-12-15" ENRICHMENT: FUNCTION: product_detail_boost PARAMS: quality_boost: 0.15 enrichment_fraction: 1.0 enrichment_start: "2024-11-23" seed: 42 ``` | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `ENRICHMENT.FUNCTION` | string | Yes | Enrichment function: `"product_detail_boost"` | | `ENRICHMENT.PARAMS.quality_boost` | float | Yes | Magnitude of the quality score boost (e.g., 0.15) | | `ENRICHMENT.PARAMS.enrichment_fraction` | float | No | Fraction of products to enrich (0.0-1.0, default 1.0) | | `ENRICHMENT.PARAMS.enrichment_start` | string | Yes | Date when enrichment begins (YYYY-MM-DD) | | `ENRICHMENT.PARAMS.seed` | int | No | Random seed for reproducibility | ### TRANSFORM configuration Optional transformation applied to data before model fitting. ```yaml DATA: TRANSFORM: FUNCTION: aggregate_by_date PARAMS: metric: revenue ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `FUNCTION` | string | No | `"passthrough"` | Transform function name | | `PARAMS` | object | No | `{}` | Function-specific parameters | #### Available transforms Each model typically pairs with a specific transform. The engine selects the transform by name from a registry. | Transform | Used With | Description | Key Parameters | |-----------|-----------|-------------|----------------| | `passthrough` | Any | No-op default. Passes data through unchanged. | None | | `aggregate_by_date` | Interrupted Time Series | Sums all numeric columns by date, producing one row per date. | `metric`: column to validate exists (default `"revenue"`) | | `prepare_for_synthetic_control` | Synthetic Control | Adds a `treatment` column derived from enrichment status and date. | `enrichment_start`: date when enrichment began (auto-injected from ENRICHMENT.PARAMS) | | `aggregate_for_approximation` | Metrics Approximation | Aggregates baseline metric per product into cross-sectional format. | `baseline_metric`: column to aggregate (default `"revenue"`) | | `prepare_simulator_for_approximation` | Metrics Approximation (simulator source) | Converts simulator time-series into before/after quality scores and baseline sales per product. | `enrichment_start`: date split point (required), `baseline_metric`: column to aggregate (default `"revenue"`) | ## MEASUREMENT section Configures the statistical model for impact analysis. ### Common parameters | Parameter | Type | Required | Description | |-----------|------|----------|-------------| | `MODEL` | string | No | Model type (default: `"interrupted_time_series"`) | | `PARAMS` | object | Yes | Model-specific parameters | ### Interrupted time series model ```yaml MEASUREMENT: MODEL: interrupted_time_series PARAMS: intervention_date: "2024-01-15" dependent_variable: revenue order: [1, 0, 0] seasonal_order: [0, 0, 0, 0] ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `intervention_date` | string | Yes | - | Date when intervention occurred (YYYY-MM-DD) | | `dependent_variable` | string | No | `"revenue"` | Column name to analyze | | `order` | array | No | `[1, 0, 0]` | ARIMA order (p, d, q) | | `seasonal_order` | array | No | `[0, 0, 0, 0]` | Seasonal ARIMA order (P, D, Q, s) | ### Experiment model ```yaml MEASUREMENT: MODEL: experiment PARAMS: formula: "revenue ~ treatment + price" ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `formula` | string | Yes | - | R-style formula where all variables must exist in the DataFrame | --- ### Subclassification model ```yaml MEASUREMENT: MODEL: subclassification PARAMS: dependent_variable: revenue treatment_column: treatment covariate_columns: - price n_strata: 5 estimand: att ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `dependent_variable` | string | No | `"revenue"` | Outcome column name | | `treatment_column` | string | Yes | - | Binary treatment indicator column | | `covariate_columns` | list | Yes | - | Columns used for propensity stratification | | `n_strata` | int | No | `5` | Number of quantile-based strata | | `estimand` | string | No | `"att"` | Estimand: `"att"` or `"ate"` | --- ### Nearest neighbour matching model ```yaml MEASUREMENT: MODEL: nearest_neighbour_matching PARAMS: dependent_variable: revenue treatment_column: treatment covariate_columns: - price caliper: 0.2 replace: false ratio: 1 ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `dependent_variable` | string | No | `"revenue"` | Outcome column name | | `treatment_column` | string | Yes | - | Binary treatment indicator column | | `covariate_columns` | list | Yes | - | Columns used for matching | | `caliper` | float | No | `0.2` | Maximum distance for a valid match | | `replace` | bool | No | `false` | Whether to match with replacement | | `ratio` | int | No | `1` | Number of matches per unit | | `shuffle` | bool | No | `true` | Shuffle data before matching | | `random_state` | int | No | `null` | Random seed for reproducibility | | `n_jobs` | int | No | `1` | Number of parallel jobs | --- ### Metrics approximation model ```yaml MEASUREMENT: MODEL: metrics_approximation PARAMS: metric_before_column: quality_before metric_after_column: quality_after baseline_column: baseline_sales RESPONSE: FUNCTION: linear PARAMS: coefficient: 0.5 ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `metric_before_column` | string | No | `"quality_before"` | Column name for pre-intervention metric | | `metric_after_column` | string | No | `"quality_after"` | Column name for post-intervention metric | | `baseline_column` | string | No | `"baseline_sales"` | Column name for baseline outcome | | `RESPONSE.FUNCTION` | string | No | `"linear"` | Response function name from the response registry | | `RESPONSE.PARAMS.coefficient` | float | No | `0.5` | Coefficient for the linear response function | --- ### Synthetic control model ```yaml MEASUREMENT: MODEL: synthetic_control PARAMS: treatment_time: 15 treated_unit: "unit_A" outcome_column: revenue unit_column: unit_id time_column: date ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `treatment_time` | int | Yes | - | Time index when intervention occurred | | `treated_unit` | string | Yes | - | Name of the treated unit | | `outcome_column` | string | Yes | - | Column with the outcome variable | | `unit_column` | string | No | `"unit_id"` | Column identifying units | | `time_column` | string | No | `"date"` | Column identifying time periods | | `optim_method` | string | No | `"Nelder-Mead"` | Optimization method passed to pysyncon | | `optim_initial` | string | No | `"equal"` | Initial weight strategy for optimization | --- ## OUTPUT section Configures where results are stored. ```yaml OUTPUT: PATH: output ``` | Parameter | Type | Required | Default | Description | |-----------|------|----------|---------|-------------| | `PATH` | string | No | `"output"` | Directory for output files | ## Complete example ```yaml DATA: SOURCE: type: simulator CONFIG: mode: rule seed: 42 path: data/products.csv start_date: "2024-01-01" end_date: "2024-03-31" TRANSFORM: FUNCTION: aggregate_by_date PARAMS: metric: revenue MEASUREMENT: MODEL: interrupted_time_series PARAMS: intervention_date: "2024-02-01" dependent_variable: revenue order: [1, 0, 0] seasonal_order: [0, 0, 0, 7] OUTPUT: PATH: output ```