Architecture
The Online Retail Simulator follows a modular, configuration-driven architecture that supports multiple generation modes and extensible enrichment capabilities.
Core Design Principles
1. Configuration-Driven Workflow
All simulation behavior is controlled through YAML configuration files, enabling:
Reproducible experiments with version-controlled configs
Easy parameter sweeps and scenario testing
Clear separation of logic and parameters
2. Modular Architecture
The system is organized into distinct, loosely-coupled modules:
Core: Shared infrastructure including
FunctionRegistryfor extensible function registrationSimulation: Core data generation logic
Enrichment: Treatment effect application
Configuration: Parameter processing and validation
Storage: Data persistence and retrieval
3. Mode-Based Generation
Two complementary approaches for different use cases:
Rule-based: Deterministic, interpretable patterns
Synthesizer-based: ML-learned patterns from real data
4. Reproducible Output
Seed-based deterministic generation ensures:
Consistent results across runs
Reliable A/B testing scenarios
Debuggable data generation
System Architecture
┌─────────────────────────────────────────────────────────────┐
│ Configuration Layer │
├─────────────────────────────────────────────────────────────┤
│ config_processor.py │ config_defaults.yaml │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Simulation Layer │
├─────────────────────────────────────────────────────────────┤
│ simulate.py (orchestrator) │
│ ├── simulate_characteristics.py │
│ │ ├── characteristics_rule_based.py │
│ │ └── characteristics_synthesizer_based.py │
│ └── simulate_metrics.py │
│ ├── metrics_rule_based.py │
│ └── metrics_synthesizer_based.py │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Enrichment Layer │
├─────────────────────────────────────────────────────────────┤
│ enrich.py (orchestrator) │
│ ├── enrichment.py │
│ ├── enrichment_library.py │
│ └── enrichment_registry.py │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Storage Layer │
├─────────────────────────────────────────────────────────────┤
│ JSON/CSV export │ Pandas DataFrames │ Pickle models │
└─────────────────────────────────────────────────────────────┘
Data Flow
1. Configuration Processing
# config_processor.py
def load_config(config_path):
"""Load and validate configuration with defaults"""
config = yaml.load(config_path)
return merge_with_defaults(config)
2. Quality Score
Products include a quality_score (0.0 - 1.0) that reflects data quality based on
title, description, features, and brand. The score is calculated after product details
are generated (not after characteristics, since there’s no content to evaluate).
Stage |
Typical Score |
Reason |
|---|---|---|
After characteristics |
N/A |
No quality_score (only identifier, category, price) |
After product details |
~0.70-0.85 |
Title, description, brand, features added |
After enrichment (treated) |
~0.85+ |
Enhanced content (if quality_boost applied) |
Score Components:
Title quality (30%): Title length (up to 50 chars)
Description quality (35%): Description length (up to 100 chars)
Features quality (20%): Features list (up to 4 items)
Brand (15%): Brand field populated
Impact on Metrics: Quality score affects conversion probability in metrics simulation. If quality_score is not present (e.g., right after characteristics), a neutral default of 0.5 is used:
# Maps quality_score [0,1] to multiplier [0.8, 1.2]
# Default 0.5 = multiplier 1.0 (no effect)
quality_score = product.get("quality_score", 0.5)
quality_multiplier = 0.8 + (quality_score * 0.4)
adjusted_sale_prob = sale_prob * quality_multiplier
3. Two-Phase Generation
Phase 1: Product Characteristics
# Generate product catalog, returns JobInfo
job_info = simulate_characteristics(config)
# Output: JobInfo (products.csv saved to job directory)
results = load_job_results(job_info)
products_df = results["products"]
Phase 2: Product Metrics
# Generate product metrics, takes JobInfo
job_info = simulate_metrics(job_info, config)
# Output: JobInfo (metrics.csv saved to job directory)
results = load_job_results(job_info)
metrics_df = results["metrics"]
4. Optional Enrichment
# Apply treatment effects
enriched_job = enrich("enrichment_config.yaml", baseline_job)
# Output: JobInfo for enriched results
Enrichment functions can optionally update quality_score for treated products
using the quality_boost parameter:
IMPACT:
FUNCTION: "product_detail_boost"
PARAMS:
quality_boost: 0.15 # Optional: adds +0.15 to treated products' quality_score
Backend Plugin Architecture
The simulation system uses a plugin architecture for backend dispatch, making it easy to add new generation backends without modifying core orchestration code.
Core Components
# core/backends.py
class SimulationBackend(ABC):
"""Abstract base class for simulation backends."""
def simulate_characteristics(self) -> pd.DataFrame:
"""Generate product characteristics."""
...
def simulate_metrics(self, product_characteristics: pd.DataFrame) -> pd.DataFrame:
"""Generate metrics based on characteristics."""
...
@classmethod
def get_key(cls) -> str:
"""Config key that triggers this backend (e.g., 'RULE')."""
...
class BackendRegistry:
"""Registry for discovering and instantiating backends."""
@classmethod
def register(cls, backend_cls):
"""Register a backend class."""
@classmethod
def detect_backend(cls, config) -> SimulationBackend:
"""Detect and instantiate appropriate backend from config."""
Built-in Backends
Backend |
Config Key |
Description |
|---|---|---|
|
|
Deterministic rule-based generation |
|
|
ML-based generation using SDV |
Backend Detection
The system automatically detects which backend to use based on config keys:
# Config with RULE key -> RuleBackend
config = {"RULE": {"CHARACTERISTICS": {...}, "METRICS": {...}}}
# Config with SYNTHESIZER key -> SynthesizerBackend
config = {"SYNTHESIZER": {"CHARACTERISTICS": {...}, "METRICS": {...}}}
Adding Custom Backends
To add a new backend (e.g., CTGAN, TVAE):
from online_retail_simulator.core.backends import (
BackendRegistry,
SimulationBackend,
)
@BackendRegistry.register
class CTGANBackend(SimulationBackend):
@classmethod
def get_key(cls) -> str:
return "CTGAN"
def simulate_characteristics(self) -> pd.DataFrame:
# Your CTGAN implementation
...
def simulate_metrics(self, product_characteristics: pd.DataFrame) -> pd.DataFrame:
# Your CTGAN implementation
...
Once registered, use it with:
CTGAN:
CHARACTERISTICS:
PARAMS: {...}
METRICS:
PARAMS: {...}
Enrichment System
Function Registry
The system uses a unified FunctionRegistry class for all extensible function types:
# core/registry.py
class FunctionRegistry:
def register(self, name, func):
"""Register function with signature validation"""
def get(self, name):
"""Retrieve registered function (lazy loads defaults)"""
def list(self):
"""List all registered function names"""
Both simulation and enrichment registries use this common infrastructure.
Built-in Impact Functions
Quantity Boost
def quantity_boost(metrics, effect_size=0.5, enrichment_fraction=0.3,
enrichment_start="2024-11-15", seed=42, **kwargs):
"""Simple multiplicative increase in ordered units"""
# Boosts ordered_units by effect_size for enriched products
# Returns: List of modified metric dictionaries
Probability Boost
def probability_boost(metrics, **kwargs):
"""Increase sale probability for treated products"""
# Same as quantity_boost (probability reflected in quantity for existing records)
Combined Boost (Realistic)
def combined_boost(metrics, effect_size=0.5, ramp_days=7, enrichment_fraction=0.3,
enrichment_start="2024-11-15", seed=42, **kwargs):
"""Gradual rollout with partial treatment"""
# Realistic implementation with:
# - Gradual effect ramp-up over ramp_days
# - Partial product treatment (enrichment_fraction)
# - Date-based activation (enrichment_start)
# Returns: List of modified metric dictionaries
Configuration
For complete configuration schema and parameter documentation, see the Configuration Guide.
Extension Points
Custom Enrichment Functions
def my_custom_effect(df, my_param, **kwargs):
"""Custom treatment effect implementation"""
# Your logic here
return modified_df
# Register for use
from online_retail_simulator.enrich import register_enrichment_function
register_enrichment_function("my_effect", my_custom_effect)
Custom Synthesizers
# Extend synthesizer support
class MyCustomSynthesizer:
def fit(self, data):
"""Train on seed data"""
def sample(self, num_rows):
"""Generate synthetic data"""