Download the notebook here!
Interactive online version:
Demo
This notebook provides a high-level overview of the Online Retail Simulator package and its capabilities.
What is Online Retail Simulator?
A Python package for generating synthetic e-commerce data for:
Testing and demos without exposing real business data
ML model training with realistic retail patterns
A/B test simulation and experimentation
Teaching analytics and data science concepts
Key Capabilities
Rule-based generation: Fast, configurable synthetic data
ML-based synthesis: Learn patterns from real data (optional SDV integration)
Reproducible results: Seed control for deterministic output
8 product categories: Electronics, Books, Clothing, and more
Funnel metrics: Impressions, visits, cart adds, orders
Setup
First, let’s install the package (if running in Colab) and import the necessary libraries.
[1]:
# Uncomment if running in Google Colab
# !pip install online-retail-simulator matplotlib seaborn
[2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from online_retail_simulator import simulate, load_job_results
# Set plot style
sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)
Generate Sample Data
We’ll generate 30 days of synthetic sales data with a simple configuration.
[3]:
import os
# Run simulation using config file
config_path = os.path.join(os.path.dirname(__file__) if "__file__" in dir() else ".", "config_demo.yaml")
job_info = simulate(config_path)
# Load results
results = load_job_results(job_info)
products_df = results["products"]
metrics_df = results["metrics"]
print(f"Generated {len(products_df)} products")
print(f"Generated {len(metrics_df)} metrics records")
Generated 100 products
Generated 3000 metrics records
Exploring the Generated Data
Let’s look at the structure and contents of our synthetic dataset.
[4]:
# Preview the metrics data
print(f"Date range: {metrics_df['date'].min()} to {metrics_df['date'].max()}")
print(f"Categories: {metrics_df['category'].nunique()}")
print(f"Total revenue: ${metrics_df['revenue'].sum():,.2f}")
print()
metrics_df.head(10)
Date range: 2024-11-01 to 2024-11-30
Categories: 8
Total revenue: $110,511.49
[4]:
| product_identifier | category | price | date | impressions | visits | cart_adds | ordered_units | revenue | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | B1P4DZHDS9 | Electronics | 686.37 | 2024-11-01 | 0 | 0 | 0 | 0 | 0.00 |
| 1 | B1SE4QSNG7 | Toys & Games | 80.75 | 2024-11-01 | 100 | 16 | 3 | 3 | 242.25 |
| 2 | BXTPQIDT5C | Food & Beverage | 42.02 | 2024-11-01 | 0 | 0 | 0 | 0 | 0.00 |
| 3 | B3F1ZMC8Q6 | Food & Beverage | 33.42 | 2024-11-01 | 0 | 0 | 0 | 0 | 0.00 |
| 4 | B2NQRBTF0Y | Toys & Games | 27.52 | 2024-11-01 | 25 | 3 | 0 | 0 | 0.00 |
| 5 | B0OL6NCQ2G | Health & Beauty | 77.66 | 2024-11-01 | 50 | 7 | 1 | 0 | 0.00 |
| 6 | BELIUY7PF3 | Books | 33.79 | 2024-11-01 | 10 | 1 | 0 | 0 | 0.00 |
| 7 | BZ13P24N6K | Toys & Games | 38.11 | 2024-11-01 | 0 | 0 | 0 | 0 | 0.00 |
| 8 | BY3H2A222X | Clothing | 40.85 | 2024-11-01 | 200 | 34 | 9 | 1 | 40.85 |
| 9 | BZUQSUBFIE | Books | 49.04 | 2024-11-01 | 10 | 1 | 0 | 0 | 0.00 |
Revenue by Category
How is revenue distributed across product categories?
[5]:
# Revenue by category
category_revenue = metrics_df.groupby("category")["revenue"].sum().sort_values()
fig, ax = plt.subplots(figsize=(10, 6))
category_revenue.plot(kind="barh", ax=ax, color=sns.color_palette("viridis", len(category_revenue)))
ax.set_xlabel("Revenue ($)")
ax.set_ylabel("Category")
ax.set_title("Total Revenue by Category")
ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f"${x:,.0f}"))
plt.tight_layout()
plt.show()
Daily Sales Trend
How do sales vary over time?
[6]:
# Daily sales trend
daily_sales = metrics_df.groupby("date").agg({
"ordered_units": "sum",
"revenue": "sum"
}).reset_index()
daily_sales["date"] = pd.to_datetime(daily_sales["date"])
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(daily_sales["date"], daily_sales["revenue"], marker="o", linewidth=2, markersize=4)
ax.fill_between(daily_sales["date"], daily_sales["revenue"], alpha=0.3)
ax.set_xlabel("Date")
ax.set_ylabel("Revenue ($)")
ax.set_title("Daily Revenue Trend")
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f"${x:,.0f}"))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Conversion Funnel
The data includes full customer journey metrics: impressions, visits, cart adds, and orders.
[7]:
# Conversion funnel
funnel_data = {
"Impressions": metrics_df["impressions"].sum(),
"Visits": metrics_df["visits"].sum(),
"Cart Adds": metrics_df["cart_adds"].sum(),
"Orders": metrics_df["ordered_units"].sum()
}
stages = list(funnel_data.keys())
values = list(funnel_data.values())
fig, ax = plt.subplots(figsize=(10, 6))
colors = sns.color_palette("Blues_r", len(stages))
bars = ax.barh(stages[::-1], values[::-1], color=colors)
ax.set_xlabel("Count")
ax.set_title("Customer Journey Funnel")
# Add value labels
for bar, val in zip(bars, values[::-1]):
ax.text(val + max(values) * 0.01, bar.get_y() + bar.get_height() / 2,
f"{val:,}", va="center", fontsize=10)
# Add conversion rates
print("Conversion Rates:")
print(f" Impressions → Visits: {values[1]/values[0]*100:.1f}%")
print(f" Visits → Cart Adds: {values[2]/values[1]*100:.1f}%")
print(f" Cart Adds → Orders: {values[3]/values[2]*100:.1f}%")
print(f" Overall (Impressions → Orders): {values[3]/values[0]*100:.2f}%")
plt.tight_layout()
plt.show()
Conversion Rates:
Impressions → Visits: 13.8%
Visits → Cart Adds: 16.4%
Cart Adds → Orders: 37.2%
Overall (Impressions → Orders): 0.84%
Descriptive Analysis
Let’s dive deeper into the data patterns.
Distribution of Order Values
[8]:
# Distribution of revenue per transaction
fig, ax = plt.subplots(figsize=(10, 5))
sns.histplot(metrics_df["revenue"], bins=50, kde=True, ax=ax)
ax.set_xlabel("Revenue ($)")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of Transaction Revenue")
ax.axvline(metrics_df["revenue"].mean(), color="red", linestyle="--", label=f"Mean: ${metrics_df['revenue'].mean():,.2f}")
ax.axvline(metrics_df["revenue"].median(), color="orange", linestyle="--", label=f"Median: ${metrics_df['revenue'].median():,.2f}")
ax.legend()
plt.tight_layout()
plt.show()
Units per Order by Category
[9]:
# Units per order by category
fig, ax = plt.subplots(figsize=(12, 6))
order = metrics_df.groupby("category")["ordered_units"].median().sort_values().index
sns.boxplot(data=metrics_df, x="category", y="ordered_units", order=order, palette="viridis", ax=ax)
ax.set_xlabel("Category")
ax.set_ylabel("Ordered Units")
ax.set_title("Distribution of Ordered Units by Category")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()
/tmp/ipykernel_3256/1803871912.py:4: FutureWarning:
Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.
sns.boxplot(data=metrics_df, x="category", y="ordered_units", order=order, palette="viridis", ax=ax)
Correlation Between Metrics
[10]:
# Correlation heatmap of numeric metrics
numeric_cols = ["price", "impressions", "visits", "cart_adds", "ordered_units", "revenue"]
correlation_matrix = metrics_df[numeric_cols].corr()
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", center=0,
fmt=".2f", square=True, ax=ax, linewidths=0.5)
ax.set_title("Correlation Matrix of Sales Metrics")
plt.tight_layout()
plt.show()
Enrichment: Simulating Treatment Effects
The package can simulate treatment effects (e.g., A/B test outcomes) by boosting sales for a subset of products starting at a specific date.
[11]:
from online_retail_simulator import enrich
# Apply enrichment using config file (boost sales by 50% for 30% of products starting Nov 15)
enrich_config_path = os.path.join(os.path.dirname(__file__) if "__file__" in dir() else ".", "config_enrichment.yaml")
enriched_job = enrich(enrich_config_path, job_info)
# Load enriched results
enriched_results = load_job_results(enriched_job)
enriched_df = enriched_results["enriched"]
print(f"Applied enrichment to {len(enriched_df)} records")
Applied enrichment to 3000 records
[12]:
# Compare before and after: daily revenue time series
daily_original = metrics_df.groupby("date")["revenue"].sum().reset_index()
daily_original["date"] = pd.to_datetime(daily_original["date"])
daily_original["type"] = "Original"
daily_enriched = enriched_df.groupby("date")["revenue"].sum().reset_index()
daily_enriched["date"] = pd.to_datetime(daily_enriched["date"])
daily_enriched["type"] = "Enriched"
# Plot comparison
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(daily_original["date"], daily_original["revenue"],
marker="o", linewidth=2, markersize=4, label="Original", color="#1f77b4")
ax.plot(daily_enriched["date"], daily_enriched["revenue"],
marker="s", linewidth=2, markersize=4, label="Enriched", color="#2ca02c")
# Mark enrichment start
enrichment_start = pd.to_datetime("2024-11-15")
ax.axvline(enrichment_start, color="red", linestyle="--", alpha=0.7, label="Enrichment Start")
ax.set_xlabel("Date")
ax.set_ylabel("Revenue ($)")
ax.set_title("Daily Revenue: Before vs After Enrichment")
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f"${x:,.0f}"))
ax.legend()
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
# Print lift metrics
post_start = enriched_df["date"] >= "2024-11-15"
original_post_revenue = metrics_df[metrics_df["date"] >= "2024-11-15"]["revenue"].sum()
enriched_post_revenue = enriched_df[post_start]["revenue"].sum()
lift = (enriched_post_revenue / original_post_revenue - 1) * 100
print(f"\nPost-enrichment period (Nov 15-30):")
print(f" Original revenue: ${original_post_revenue:,.2f}")
print(f" Enriched revenue: ${enriched_post_revenue:,.2f}")
print(f" Revenue lift: {lift:.1f}%")
Post-enrichment period (Nov 15-30):
Original revenue: $51,291.95
Enriched revenue: $132,747.41
Revenue lift: 158.8%
Next Steps
This overview covers the basics of generating and exploring synthetic retail data. For more details:
Full Documentation: Online Retail Simulator Docs
Configuration Reference: Learn about all available parameters
API Reference: Detailed function documentation
Demo Scripts: See
demo/directory for more examples
Key Functions
# Core simulation
simulate(config_path) # Generate complete dataset
simulate_products() # Generate product catalog only
simulate_metrics() # Generate sales metrics
# Enrichment
enrich(config_path, job) # Apply treatment effects
# Results management
load_job_results(job) # Load all results
list_jobs() # List saved jobs
cleanup_old_jobs(days=30) # Clean up old outputs