{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "# Demo\n\nThis notebook provides a high-level overview of the **Online Retail Simulator** package and its capabilities.\n\n## What is Online Retail Simulator?\n\nA Python package for generating **synthetic e-commerce data** for:\n- Testing and demos without exposing real business data\n- ML model training with realistic retail patterns\n- A/B test simulation and experimentation\n- Teaching analytics and data science concepts\n\n## Key Capabilities\n\n- **Rule-based generation**: Fast, configurable synthetic data\n- **ML-based synthesis**: Learn patterns from real data (optional SDV integration)\n- **Reproducible results**: Seed control for deterministic output\n- **8 product categories**: Electronics, Books, Clothing, and more\n- **Funnel metrics**: Impressions, visits, cart adds, orders" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "First, let's install the package (if running in Colab) and import the necessary libraries." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Uncomment if running in Google Colab\n", "# !pip install online-retail-simulator matplotlib seaborn" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from online_retail_simulator import simulate, load_job_results\n", "\n", "# Set plot style\n", "sns.set_theme(style=\"whitegrid\")\n", "plt.rcParams[\"figure.figsize\"] = (10, 6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generate Sample Data\n", "\n", "We'll generate 30 days of synthetic sales data with a simple configuration." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "import os\n\n# Run simulation using config file\nconfig_path = os.path.join(os.path.dirname(__file__) if \"__file__\" in dir() else \".\", \"config_demo.yaml\")\njob_info = simulate(config_path)\n\n# Load results\nresults = load_job_results(job_info)\nproducts_df = results[\"products\"]\nmetrics_df = results[\"metrics\"]\n\nprint(f\"Generated {len(products_df)} products\")\nprint(f\"Generated {len(metrics_df)} metrics records\")" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Generated Data\n", "\n", "Let's look at the structure and contents of our synthetic dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Preview the metrics data\nprint(f\"Date range: {metrics_df['date'].min()} to {metrics_df['date'].max()}\")\nprint(f\"Categories: {metrics_df['category'].nunique()}\")\nprint(f\"Total revenue: ${metrics_df['revenue'].sum():,.2f}\")\nprint()\nmetrics_df.head(10)" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Revenue by Category\n", "\n", "How is revenue distributed across product categories?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Revenue by category\ncategory_revenue = metrics_df.groupby(\"category\")[\"revenue\"].sum().sort_values()\n\nfig, ax = plt.subplots(figsize=(10, 6))\ncategory_revenue.plot(kind=\"barh\", ax=ax, color=sns.color_palette(\"viridis\", len(category_revenue)))\nax.set_xlabel(\"Revenue ($)\")\nax.set_ylabel(\"Category\")\nax.set_title(\"Total Revenue by Category\")\nax.xaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f\"${x:,.0f}\"))\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Daily Sales Trend\n", "\n", "How do sales vary over time?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Daily sales trend\ndaily_sales = metrics_df.groupby(\"date\").agg({\n \"ordered_units\": \"sum\",\n \"revenue\": \"sum\"\n}).reset_index()\ndaily_sales[\"date\"] = pd.to_datetime(daily_sales[\"date\"])\n\nfig, ax = plt.subplots(figsize=(12, 5))\nax.plot(daily_sales[\"date\"], daily_sales[\"revenue\"], marker=\"o\", linewidth=2, markersize=4)\nax.fill_between(daily_sales[\"date\"], daily_sales[\"revenue\"], alpha=0.3)\nax.set_xlabel(\"Date\")\nax.set_ylabel(\"Revenue ($)\")\nax.set_title(\"Daily Revenue Trend\")\nax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f\"${x:,.0f}\"))\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conversion Funnel\n", "\n", "The data includes full customer journey metrics: impressions, visits, cart adds, and orders." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Conversion funnel\nfunnel_data = {\n \"Impressions\": metrics_df[\"impressions\"].sum(),\n \"Visits\": metrics_df[\"visits\"].sum(),\n \"Cart Adds\": metrics_df[\"cart_adds\"].sum(),\n \"Orders\": metrics_df[\"ordered_units\"].sum()\n}\n\nstages = list(funnel_data.keys())\nvalues = list(funnel_data.values())\n\nfig, ax = plt.subplots(figsize=(10, 6))\ncolors = sns.color_palette(\"Blues_r\", len(stages))\nbars = ax.barh(stages[::-1], values[::-1], color=colors)\nax.set_xlabel(\"Count\")\nax.set_title(\"Customer Journey Funnel\")\n\n# Add value labels\nfor bar, val in zip(bars, values[::-1]):\n ax.text(val + max(values) * 0.01, bar.get_y() + bar.get_height() / 2,\n f\"{val:,}\", va=\"center\", fontsize=10)\n\n# Add conversion rates\nprint(\"Conversion Rates:\")\nprint(f\" Impressions → Visits: {values[1]/values[0]*100:.1f}%\")\nprint(f\" Visits → Cart Adds: {values[2]/values[1]*100:.1f}%\")\nprint(f\" Cart Adds → Orders: {values[3]/values[2]*100:.1f}%\")\nprint(f\" Overall (Impressions → Orders): {values[3]/values[0]*100:.2f}%\")\n\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Descriptive Analysis\n", "\n", "Let's dive deeper into the data patterns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Distribution of Order Values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Distribution of revenue per transaction\nfig, ax = plt.subplots(figsize=(10, 5))\nsns.histplot(metrics_df[\"revenue\"], bins=50, kde=True, ax=ax)\nax.set_xlabel(\"Revenue ($)\")\nax.set_ylabel(\"Frequency\")\nax.set_title(\"Distribution of Transaction Revenue\")\nax.axvline(metrics_df[\"revenue\"].mean(), color=\"red\", linestyle=\"--\", label=f\"Mean: ${metrics_df['revenue'].mean():,.2f}\")\nax.axvline(metrics_df[\"revenue\"].median(), color=\"orange\", linestyle=\"--\", label=f\"Median: ${metrics_df['revenue'].median():,.2f}\")\nax.legend()\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Units per Order by Category" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Units per order by category\nfig, ax = plt.subplots(figsize=(12, 6))\norder = metrics_df.groupby(\"category\")[\"ordered_units\"].median().sort_values().index\nsns.boxplot(data=metrics_df, x=\"category\", y=\"ordered_units\", order=order, palette=\"viridis\", ax=ax)\nax.set_xlabel(\"Category\")\nax.set_ylabel(\"Ordered Units\")\nax.set_title(\"Distribution of Ordered Units by Category\")\nplt.xticks(rotation=45, ha=\"right\")\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Correlation Between Metrics" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Correlation heatmap of numeric metrics\nnumeric_cols = [\"price\", \"impressions\", \"visits\", \"cart_adds\", \"ordered_units\", \"revenue\"]\ncorrelation_matrix = metrics_df[numeric_cols].corr()\n\nfig, ax = plt.subplots(figsize=(8, 6))\nsns.heatmap(correlation_matrix, annot=True, cmap=\"coolwarm\", center=0,\n fmt=\".2f\", square=True, ax=ax, linewidths=0.5)\nax.set_title(\"Correlation Matrix of Sales Metrics\")\nplt.tight_layout()\nplt.show()" }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Enrichment: Simulating Treatment Effects\n", "\n", "The package can simulate treatment effects (e.g., A/B test outcomes) by boosting sales for a subset of products starting at a specific date." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "from online_retail_simulator import enrich\n\n# Apply enrichment using config file (boost sales by 50% for 30% of products starting Nov 15)\nenrich_config_path = os.path.join(os.path.dirname(__file__) if \"__file__\" in dir() else \".\", \"config_enrichment.yaml\")\nenriched_job = enrich(enrich_config_path, job_info)\n\n# Load enriched results\nenriched_results = load_job_results(enriched_job)\nenriched_df = enriched_results[\"enriched\"]\nprint(f\"Applied enrichment to {len(enriched_df)} records\")" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": "# Compare before and after: daily revenue time series\ndaily_original = metrics_df.groupby(\"date\")[\"revenue\"].sum().reset_index()\ndaily_original[\"date\"] = pd.to_datetime(daily_original[\"date\"])\ndaily_original[\"type\"] = \"Original\"\n\ndaily_enriched = enriched_df.groupby(\"date\")[\"revenue\"].sum().reset_index()\ndaily_enriched[\"date\"] = pd.to_datetime(daily_enriched[\"date\"])\ndaily_enriched[\"type\"] = \"Enriched\"\n\n# Plot comparison\nfig, ax = plt.subplots(figsize=(12, 6))\nax.plot(daily_original[\"date\"], daily_original[\"revenue\"], \n marker=\"o\", linewidth=2, markersize=4, label=\"Original\", color=\"#1f77b4\")\nax.plot(daily_enriched[\"date\"], daily_enriched[\"revenue\"], \n marker=\"s\", linewidth=2, markersize=4, label=\"Enriched\", color=\"#2ca02c\")\n\n# Mark enrichment start\nenrichment_start = pd.to_datetime(\"2024-11-15\")\nax.axvline(enrichment_start, color=\"red\", linestyle=\"--\", alpha=0.7, label=\"Enrichment Start\")\n\nax.set_xlabel(\"Date\")\nax.set_ylabel(\"Revenue ($)\")\nax.set_title(\"Daily Revenue: Before vs After Enrichment\")\nax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f\"${x:,.0f}\"))\nax.legend()\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()\n\n# Print lift metrics\npost_start = enriched_df[\"date\"] >= \"2024-11-15\"\noriginal_post_revenue = metrics_df[metrics_df[\"date\"] >= \"2024-11-15\"][\"revenue\"].sum()\nenriched_post_revenue = enriched_df[post_start][\"revenue\"].sum()\nlift = (enriched_post_revenue / original_post_revenue - 1) * 100\n\nprint(f\"\\nPost-enrichment period (Nov 15-30):\")\nprint(f\" Original revenue: ${original_post_revenue:,.2f}\")\nprint(f\" Enriched revenue: ${enriched_post_revenue:,.2f}\")\nprint(f\" Revenue lift: {lift:.1f}%\")" }, { "cell_type": "markdown", "metadata": {}, "source": "## Next Steps\n\nThis overview covers the basics of generating and exploring synthetic retail data. For more details:\n\n- **Full Documentation**: [Online Retail Simulator Docs](https://eisenhauerio.github.io/tools-catalog-generator/)\n- **Configuration Reference**: Learn about all available parameters\n- **API Reference**: Detailed function documentation\n- **Demo Scripts**: See `demo/` directory for more examples\n\n### Key Functions\n\n```python\n# Core simulation\nsimulate(config_path) # Generate complete dataset\nsimulate_products() # Generate product catalog only\nsimulate_metrics() # Generate sales metrics\n\n# Enrichment\nenrich(config_path, job) # Apply treatment effects\n\n# Results management\nload_job_results(job) # Load all results\nlist_jobs() # List saved jobs\ncleanup_old_jobs(days=30) # Clean up old outputs\n```" } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } }, "nbformat": 4, "nbformat_minor": 4 }