# Architecture The Online Retail Simulator follows a modular, configuration-driven architecture that supports multiple generation modes and extensible enrichment capabilities. ## Core Design Principles ### 1. Configuration-Driven Workflow All simulation behavior is controlled through YAML configuration files, enabling: - Reproducible experiments with version-controlled configs - Easy parameter sweeps and scenario testing - Clear separation of logic and parameters ### 2. Modular Architecture The system is organized into distinct, loosely-coupled modules: - **Core**: Shared infrastructure including `FunctionRegistry` for extensible function registration - **Simulation**: Core data generation logic - **Enrichment**: Treatment effect application - **Configuration**: Parameter processing and validation - **Storage**: Data persistence and retrieval ### 3. Mode-Based Generation Two complementary approaches for different use cases: - **Rule-based**: Deterministic, interpretable patterns - **Synthesizer-based**: ML-learned patterns from real data ### 4. Reproducible Output Seed-based deterministic generation ensures: - Consistent results across runs - Reliable A/B testing scenarios - Debuggable data generation ## System Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Configuration Layer │ ├─────────────────────────────────────────────────────────────┤ │ config_processor.py │ config_defaults.yaml │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Simulation Layer │ ├─────────────────────────────────────────────────────────────┤ │ simulate.py (orchestrator) │ │ ├── simulate_characteristics.py │ │ │ ├── characteristics_rule_based.py │ │ │ └── characteristics_synthesizer_based.py │ │ └── simulate_metrics.py │ │ ├── metrics_rule_based.py │ │ └── metrics_synthesizer_based.py │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Enrichment Layer │ ├─────────────────────────────────────────────────────────────┤ │ enrich.py (orchestrator) │ │ ├── enrichment.py │ │ ├── enrichment_library.py │ │ └── enrichment_registry.py │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Storage Layer │ ├─────────────────────────────────────────────────────────────┤ │ JSON/CSV export │ Pandas DataFrames │ Pickle models │ └─────────────────────────────────────────────────────────────┘ ``` ## Data Flow ### 1. Configuration Processing ```python # config_processor.py def load_config(config_path): """Load and validate configuration with defaults""" config = yaml.load(config_path) return merge_with_defaults(config) ``` ### 2. Quality Score Products include a `quality_score` (0.0 - 1.0) that reflects data quality based on title, description, features, and brand. The score is calculated after product details are generated (not after characteristics, since there's no content to evaluate). | Stage | Typical Score | Reason | |-------|---------------|--------| | After characteristics | N/A | No quality_score (only identifier, category, price) | | After product details | ~0.70-0.85 | Title, description, brand, features added | | After enrichment (treated) | ~0.85+ | Enhanced content (if quality_boost applied) | **Score Components:** - Title quality (30%): Title length (up to 50 chars) - Description quality (35%): Description length (up to 100 chars) - Features quality (20%): Features list (up to 4 items) - Brand (15%): Brand field populated **Impact on Metrics:** Quality score affects conversion probability in metrics simulation. If quality_score is not present (e.g., right after characteristics), a neutral default of 0.5 is used: ```python # Maps quality_score [0,1] to multiplier [0.8, 1.2] # Default 0.5 = multiplier 1.0 (no effect) quality_score = product.get("quality_score", 0.5) quality_multiplier = 0.8 + (quality_score * 0.4) adjusted_sale_prob = sale_prob * quality_multiplier ``` ### 3. Two-Phase Generation #### Phase 1: Product Characteristics ```python # Generate product catalog, returns JobInfo job_info = simulate_characteristics(config) # Output: JobInfo (products.csv saved to job directory) results = load_job_results(job_info) products_df = results["products"] ``` #### Phase 2: Product Metrics ```python # Generate product metrics, takes JobInfo job_info = simulate_metrics(job_info, config) # Output: JobInfo (metrics.csv saved to job directory) results = load_job_results(job_info) metrics_df = results["metrics"] ``` ### 4. Optional Enrichment ```python # Apply treatment effects enriched_job = enrich("enrichment_config.yaml", baseline_job) # Output: JobInfo for enriched results ``` Enrichment functions can optionally update `quality_score` for treated products using the `quality_boost` parameter: ```yaml IMPACT: FUNCTION: "product_detail_boost" PARAMS: quality_boost: 0.15 # Optional: adds +0.15 to treated products' quality_score ``` ## Backend Plugin Architecture The simulation system uses a plugin architecture for backend dispatch, making it easy to add new generation backends without modifying core orchestration code. ### Core Components ```python # core/backends.py class SimulationBackend(ABC): """Abstract base class for simulation backends.""" def simulate_characteristics(self) -> pd.DataFrame: """Generate product characteristics.""" ... def simulate_metrics(self, product_characteristics: pd.DataFrame) -> pd.DataFrame: """Generate metrics based on characteristics.""" ... @classmethod def get_key(cls) -> str: """Config key that triggers this backend (e.g., 'RULE').""" ... class BackendRegistry: """Registry for discovering and instantiating backends.""" @classmethod def register(cls, backend_cls): """Register a backend class.""" @classmethod def detect_backend(cls, config) -> SimulationBackend: """Detect and instantiate appropriate backend from config.""" ``` ### Built-in Backends | Backend | Config Key | Description | |---------|------------|-------------| | `RuleBackend` | `RULE` | Deterministic rule-based generation | | `SynthesizerBackend` | `SYNTHESIZER` | ML-based generation using SDV | ### Backend Detection The system automatically detects which backend to use based on config keys: ```python # Config with RULE key -> RuleBackend config = {"RULE": {"CHARACTERISTICS": {...}, "METRICS": {...}}} # Config with SYNTHESIZER key -> SynthesizerBackend config = {"SYNTHESIZER": {"CHARACTERISTICS": {...}, "METRICS": {...}}} ``` ### Adding Custom Backends To add a new backend (e.g., CTGAN, TVAE): ```python from online_retail_simulator.core.backends import ( BackendRegistry, SimulationBackend, ) @BackendRegistry.register class CTGANBackend(SimulationBackend): @classmethod def get_key(cls) -> str: return "CTGAN" def simulate_characteristics(self) -> pd.DataFrame: # Your CTGAN implementation ... def simulate_metrics(self, product_characteristics: pd.DataFrame) -> pd.DataFrame: # Your CTGAN implementation ... ``` Once registered, use it with: ```yaml CTGAN: CHARACTERISTICS: PARAMS: {...} METRICS: PARAMS: {...} ``` ## Enrichment System ### Function Registry The system uses a unified `FunctionRegistry` class for all extensible function types: ```python # core/registry.py class FunctionRegistry: def register(self, name, func): """Register function with signature validation""" def get(self, name): """Retrieve registered function (lazy loads defaults)""" def list(self): """List all registered function names""" ``` Both simulation and enrichment registries use this common infrastructure. ### Built-in Impact Functions #### Quantity Boost ```python def quantity_boost(metrics, effect_size=0.5, enrichment_fraction=0.3, enrichment_start="2024-11-15", seed=42, **kwargs): """Simple multiplicative increase in ordered units""" # Boosts ordered_units by effect_size for enriched products # Returns: List of modified metric dictionaries ``` #### Probability Boost ```python def probability_boost(metrics, **kwargs): """Increase sale probability for treated products""" # Same as quantity_boost (probability reflected in quantity for existing records) ``` #### Combined Boost (Realistic) ```python def combined_boost(metrics, effect_size=0.5, ramp_days=7, enrichment_fraction=0.3, enrichment_start="2024-11-15", seed=42, **kwargs): """Gradual rollout with partial treatment""" # Realistic implementation with: # - Gradual effect ramp-up over ramp_days # - Partial product treatment (enrichment_fraction) # - Date-based activation (enrichment_start) # Returns: List of modified metric dictionaries ``` ## Configuration For complete configuration schema and parameter documentation, see the [Configuration Guide](configuration.md). ## Extension Points ### Custom Enrichment Functions ```python def my_custom_effect(df, my_param, **kwargs): """Custom treatment effect implementation""" # Your logic here return modified_df # Register for use from online_retail_simulator.enrich import register_enrichment_function register_enrichment_function("my_effect", my_custom_effect) ``` ### Custom Synthesizers ```python # Extend synthesizer support class MyCustomSynthesizer: def fit(self, data): """Train on seed data""" def sample(self, num_rows): """Generate synthetic data""" ```