Architecture

The Online Retail Simulator follows a modular, configuration-driven architecture that supports multiple generation modes and extensible enrichment capabilities.

Core Design Principles

1. Configuration-Driven Workflow

All simulation behavior is controlled through YAML configuration files, enabling:

  • Reproducible experiments with version-controlled configs

  • Easy parameter sweeps and scenario testing

  • Clear separation of logic and parameters

2. Modular Architecture

The system is organized into distinct, loosely-coupled modules:

  • Core: Shared infrastructure including FunctionRegistry for extensible function registration

  • Simulation: Core data generation logic

  • Enrichment: Treatment effect application

  • Configuration: Parameter processing and validation

  • Storage: Data persistence and retrieval

3. Mode-Based Generation

Two complementary approaches for different use cases:

  • Rule-based: Deterministic, interpretable patterns

  • Synthesizer-based: ML-learned patterns from real data

4. Reproducible Output

Seed-based deterministic generation ensures:

  • Consistent results across runs

  • Reliable A/B testing scenarios

  • Debuggable data generation

System Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Configuration Layer                       │
├─────────────────────────────────────────────────────────────┤
│  config_processor.py  │  config_defaults.yaml              │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                     Simulation Layer                        │
├─────────────────────────────────────────────────────────────┤
│  simulate.py (orchestrator)                                 │
│  ├── simulate_characteristics.py                            │
│  │   ├── characteristics_rule_based.py                      │
│  │   └── characteristics_synthesizer_based.py               │
│  └── simulate_metrics.py                                    │
│      ├── metrics_rule_based.py                              │
│      └── metrics_synthesizer_based.py                       │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                    Enrichment Layer                         │
├─────────────────────────────────────────────────────────────┤
│  enrich.py (orchestrator)                                   │
│  ├── enrichment.py                                          │
│  ├── enrichment_library.py                                  │
│  └── enrichment_registry.py                                 │
└─────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌─────────────────────────────────────────────────────────────┐
│                     Storage Layer                           │
├─────────────────────────────────────────────────────────────┤
│  JSON/CSV export  │  Pandas DataFrames  │  Pickle models   │
└─────────────────────────────────────────────────────────────┘

Data Flow

1. Configuration Processing

# config_processor.py
def load_config(config_path):
    """Load and validate configuration with defaults"""
    config = yaml.load(config_path)
    return merge_with_defaults(config)

2. Quality Score

Products include a quality_score (0.0 - 1.0) that reflects data quality based on title, description, features, and brand. The score is calculated after product details are generated (not after characteristics, since there’s no content to evaluate).

Stage

Typical Score

Reason

After characteristics

N/A

No quality_score (only identifier, category, price)

After product details

~0.70-0.85

Title, description, brand, features added

After enrichment (treated)

~0.85+

Enhanced content (if quality_boost applied)

Score Components:

  • Title quality (30%): Title length (up to 50 chars)

  • Description quality (35%): Description length (up to 100 chars)

  • Features quality (20%): Features list (up to 4 items)

  • Brand (15%): Brand field populated

Impact on Metrics: Quality score affects conversion probability in metrics simulation. If quality_score is not present (e.g., right after characteristics), a neutral default of 0.5 is used:

# Maps quality_score [0,1] to multiplier [0.8, 1.2]
# Default 0.5 = multiplier 1.0 (no effect)
quality_score = product.get("quality_score", 0.5)
quality_multiplier = 0.8 + (quality_score * 0.4)
adjusted_sale_prob = sale_prob * quality_multiplier

3. Two-Phase Generation

Phase 1: Product Characteristics

# Generate product catalog, returns JobInfo
job_info = simulate_characteristics(config)
# Output: JobInfo (products.csv saved to job directory)
results = load_job_results(job_info)
products_df = results["products"]

Phase 2: Product Metrics

# Generate product metrics, takes JobInfo
job_info = simulate_metrics(job_info, config)
# Output: JobInfo (metrics.csv saved to job directory)
results = load_job_results(job_info)
metrics_df = results["metrics"]

4. Optional Enrichment

# Apply treatment effects
enriched_job = enrich("enrichment_config.yaml", baseline_job)
# Output: JobInfo for enriched results

Enrichment functions can optionally update quality_score for treated products using the quality_boost parameter:

IMPACT:
  FUNCTION: "product_detail_boost"
  PARAMS:
    quality_boost: 0.15  # Optional: adds +0.15 to treated products' quality_score

Backend Plugin Architecture

The simulation system uses a plugin architecture for backend dispatch, making it easy to add new generation backends without modifying core orchestration code.

Core Components

# core/backends.py

class SimulationBackend(ABC):
    """Abstract base class for simulation backends."""

    def simulate_characteristics(self) -> pd.DataFrame:
        """Generate product characteristics."""
        ...

    def simulate_metrics(self, product_characteristics: pd.DataFrame) -> pd.DataFrame:
        """Generate metrics based on characteristics."""
        ...

    @classmethod
    def get_key(cls) -> str:
        """Config key that triggers this backend (e.g., 'RULE')."""
        ...


class BackendRegistry:
    """Registry for discovering and instantiating backends."""

    @classmethod
    def register(cls, backend_cls):
        """Register a backend class."""

    @classmethod
    def detect_backend(cls, config) -> SimulationBackend:
        """Detect and instantiate appropriate backend from config."""

Built-in Backends

Backend

Config Key

Description

RuleBackend

RULE

Deterministic rule-based generation

SynthesizerBackend

SYNTHESIZER

ML-based generation using SDV

Backend Detection

The system automatically detects which backend to use based on config keys:

# Config with RULE key -> RuleBackend
config = {"RULE": {"CHARACTERISTICS": {...}, "METRICS": {...}}}

# Config with SYNTHESIZER key -> SynthesizerBackend
config = {"SYNTHESIZER": {"CHARACTERISTICS": {...}, "METRICS": {...}}}

Adding Custom Backends

To add a new backend (e.g., CTGAN, TVAE):

from online_retail_simulator.core.backends import (
    BackendRegistry,
    SimulationBackend,
)

@BackendRegistry.register
class CTGANBackend(SimulationBackend):

    @classmethod
    def get_key(cls) -> str:
        return "CTGAN"

    def simulate_characteristics(self) -> pd.DataFrame:
        # Your CTGAN implementation
        ...

    def simulate_metrics(self, product_characteristics: pd.DataFrame) -> pd.DataFrame:
        # Your CTGAN implementation
        ...

Once registered, use it with:

CTGAN:
  CHARACTERISTICS:
    PARAMS: {...}
  METRICS:
    PARAMS: {...}

Enrichment System

Function Registry

The system uses a unified FunctionRegistry class for all extensible function types:

# core/registry.py
class FunctionRegistry:
    def register(self, name, func):
        """Register function with signature validation"""

    def get(self, name):
        """Retrieve registered function (lazy loads defaults)"""

    def list(self):
        """List all registered function names"""

Both simulation and enrichment registries use this common infrastructure.

Built-in Impact Functions

Quantity Boost

def quantity_boost(metrics, effect_size=0.5, enrichment_fraction=0.3,
                   enrichment_start="2024-11-15", seed=42, **kwargs):
    """Simple multiplicative increase in ordered units"""
    # Boosts ordered_units by effect_size for enriched products
    # Returns: List of modified metric dictionaries

Probability Boost

def probability_boost(metrics, **kwargs):
    """Increase sale probability for treated products"""
    # Same as quantity_boost (probability reflected in quantity for existing records)

Combined Boost (Realistic)

def combined_boost(metrics, effect_size=0.5, ramp_days=7, enrichment_fraction=0.3,
                   enrichment_start="2024-11-15", seed=42, **kwargs):
    """Gradual rollout with partial treatment"""
    # Realistic implementation with:
    # - Gradual effect ramp-up over ramp_days
    # - Partial product treatment (enrichment_fraction)
    # - Date-based activation (enrichment_start)
    # Returns: List of modified metric dictionaries

Configuration

For complete configuration schema and parameter documentation, see the Configuration Guide.

Extension Points

Custom Enrichment Functions

def my_custom_effect(df, my_param, **kwargs):
    """Custom treatment effect implementation"""
    # Your logic here
    return modified_df

# Register for use
from online_retail_simulator.enrich import register_enrichment_function
register_enrichment_function("my_effect", my_custom_effect)

Custom Synthesizers

# Extend synthesizer support
class MyCustomSynthesizer:
    def fit(self, data):
        """Train on seed data"""

    def sample(self, num_rows):
        """Generate synthetic data"""