Download the notebook here! Interactive online version: colab

Matching and Subclassification

Reference: Causal Inference: The Mixtape, Chapter 5: Matching and Subclassification (pp. 175-230)

This lecture introduces methods for causal inference when selection into treatment depends on observed covariates. We apply these concepts using the Online Retail Simulator to answer: Can subclassification and matching recover the true treatment effect when confounding involves multiple continuous covariates?

Each application lecture follows the same five-step workflow. We frame a causal question in a business context, simulate data with known ground truth using the Online Retail Simulator, measure the treatment effect — first with a naive approach, then with a causal method — using the Impact Engine, evaluate how well each method recovers the truth, and tune the method’s parameters to understand how configuration choices affect the reliability of causal estimates.

Course workflow: Business Context, Simulate Data, Measure Impact, Evaluate Performance, Tune Parameters


Part I: Theory

This section covers the theoretical foundations of matching and subclassification as presented in Cunningham’s Causal Inference: The Mixtape, Chapter 5.

1. Selection on observables

In the previous lecture on directed acyclic graphs, we learned that the backdoor criterion tells us which variables to condition on to identify causal effects. When a set of observed covariates \(X\) satisfies the backdoor criterion, we have a situation called selection on observables. This means that treatment assignment, while not random, depends only on variables we can measure and control for.

The conditional independence assumption

The key identifying assumption for all methods in this lecture is the conditional independence assumption (CIA), also known as unconfoundedness or ignorability:

\[(Y^1, Y^0) \perp\!\!\!\perp D \mid X\]

where \(\perp\!\!\!\perp\) denotes statistical independence. This assumption states that, conditional on the observed covariates \(X\), the potential outcomes are independent of treatment assignment. In plain language: once we account for the variables in \(X\), treatment is “as good as random.”

What does this mean in practice? Consider two individuals with identical values of \(X\). Under CIA, the fact that one received treatment and the other did not is essentially random—there are no other factors systematically driving both their treatment status and their outcomes.

Implications of CIA

When CIA holds, the expected potential outcomes are equal across treatment groups for each value of \(X\):

\[E[Y^1 \mid D=1, X] = E[Y^1 \mid D=0, X] = E[Y^1 \mid X]\]
\[E[Y^0 \mid D=1, X] = E[Y^0 \mid D=0, X] = E[Y^0 \mid X]\]

This is powerful because it allows us to use observed outcomes from the control group to estimate the counterfactual for the treatment group (and vice versa) within each stratum defined by \(X\).

Common support

CIA alone is not sufficient for identification. We also need the common support (or overlap) assumption:

\[0 < \Pr(D=1 \mid X) < 1\]

This requires that for every value of \(X\), there is a positive probability of being in both the treatment and control groups. Without common support, we cannot compare treated and untreated units with similar characteristics—some regions of the covariate space would have only treated units or only control units.

Assumption

Formal Statement

Intuition

CIA

\((Y^1, Y^0) \perp\!\!\!\perp D \mid X\)

Treatment is “as good as random” given \(X\)

Common Support

\(0 < \Pr(D=1 \mid X) < 1\)

Both treated and control units exist for all \(X\)

Together, these assumptions allow us to identify causal effects by comparing outcomes across treatment groups within strata defined by \(X\).

From assumptions to identification

Under CIA and common support, the average treatment effect (ATE) is identified as:

\[\delta_{ATE} = E[Y^1 - Y^0] = \int \left( E[Y \mid X, D=1] - E[Y \mid X, D=0] \right) dF(X)\]

The conditional expectations \(E[Y \mid X, D=1]\) and \(E[Y \mid X, D=0]\) are directly estimable from data. The integral averages these conditional effects over the distribution of \(X\) in the population.

The challenge is how to implement this in practice. With continuous covariates or many discrete covariates, we cannot simply compute separate means for each unique value of \(X\). The methods we discuss below—subclassification and matching—provide practical approaches to this problem.

2. Subclassification

Subclassification is the most direct approach to satisfying the backdoor criterion. The idea is simple: divide the data into strata based on the confounding variable(s), compute treatment effects within each stratum, and then average across strata.

Motivating example: smoking and mortality

Consider a classic problem from epidemiology (simplified from Cochran 1968). We want to estimate the effect of smoking on mortality. The raw data shows that smokers have lower overall death rates than non-smokers—but should we believe this?

The catch is that smokers in this sample are much younger on average. Age is a confounder: it influences both who smokes and mortality risk. Comparing the two groups directly mixes the effect of smoking with the effect of age composition. Subclassification removes this confounding by comparing within age groups.

How subclassification works

The subclassification procedure consists of three steps:

  1. Stratify: Divide the sample into \(K\) mutually exclusive strata based on the covariate(s) \(X\)

  2. Compare: Within each stratum \(k\), compute the difference in mean outcomes between treated and control units

  3. Aggregate: Take a weighted average of the within-stratum effects

For the average treatment effect on the treated (ATT), the estimator is:

\[\hat{\delta}_{ATT} = \sum_{k=1}^{K} \left( \bar{Y}^{1k} - \bar{Y}^{0k} \right) \times \frac{N_T^k}{N_T}\]

where:

  • \(\bar{Y}^{1k}\) is the mean outcome for treated units in stratum \(k\)

  • \(\bar{Y}^{0k}\) is the mean outcome for control units in stratum \(k\)

  • \(N_T^k\) is the number of treated units in stratum \(k\)

  • \(N_T\) is the total number of treated units

The weights \(\frac{N_T^k}{N_T}\) ensure that the strata contribute to the overall estimate in proportion to how many treated units they contain.

Worked example: age-adjusted mortality rates

The data below (simplified from Cochran 1968) shows death rates per 1,000 person-years for smokers and non-smokers, broken down by age group. The rate columns give the death rate within each age stratum for each group. The count columns show how many people from each group fall into each age stratum—this is where the confounding lives.

Age Group

Rate (Smokers)

Rate (Non-Smokers)

# Smokers

# Non-Smokers

20–40

20

10

65

10

41–70

40

30

25

25

71+

60

50

10

65

Total

100

100

Notice that smokers are much younger (65% under 40) while non-smokers are much older (65% over 70).

Naive comparison — crude rates weighted by each group’s own age distribution:

\[\text{Smokers (crude)} = 20 \times \tfrac{65}{100} + 40 \times \tfrac{25}{100} + 60 \times \tfrac{10}{100} = 29\]
\[\text{Non-smokers (crude)} = 10 \times \tfrac{10}{100} + 30 \times \tfrac{25}{100} + 50 \times \tfrac{65}{100} = 41\]

The naive difference is \(29 - 41 = -12\), suggesting smokers have lower mortality.

With subclassification — within each age stratum, we compare directly:

Age Group

Smokers

Non-Smokers

Difference

20–40

20

10

+10

41–70

40

30

+10

71+

60

50

+10

Within every age group, smokers have a death rate 10 points higher. The naive comparison had the sign wrong because smokers were younger, and younger people die less regardless of smoking status. Subclassification removes this confounding by comparing like with like.

Covariate balance

The goal of subclassification is to achieve covariate balance—making the distribution of confounders similar across treatment and control groups. When the treated and control groups within each stratum have similar covariate values, we say the groups are exchangeable with respect to those covariates.

Balance is the empirical manifestation of the conditional independence assumption. If CIA holds and we have balanced covariates, then any remaining difference in outcomes between treated and control units can be attributed to the treatment effect.

We can check balance by comparing the means (or distributions) of covariates across treatment groups within each stratum. If the means are similar, we have achieved balance on that covariate.

The curse of dimensionality

Subclassification works well when we have one or two discrete covariates. But what happens when we have many covariates, or when covariates are continuous?

This is the curse of dimensionality. As the number of covariates \(K\) grows, the number of strata grows exponentially. With just two binary covariates, we have \(2^2 = 4\) strata. With ten binary covariates, we have \(2^{10} = 1,024\) strata.

The problem is that our sample size doesn’t grow with the number of strata. As strata multiply:

  • Many strata become empty (no observations)

  • Many strata have observations of only one type (all treated or all control)

  • Even strata with both types may have very few observations, leading to imprecise estimates

When a stratum contains only treated units or only control units, we have a common support violation—we cannot estimate the treatment effect in that stratum because we lack a comparison group.

This limitation motivates the methods we discuss next: matching, which provides a more flexible approach to achieving covariate balance.

3. Exact matching

Matching takes a different approach to the identification problem. Instead of stratifying the data and computing within-stratum means, matching directly imputes the missing counterfactual for each treated unit by finding a similar control unit.

The matching idea

Recall the fundamental problem of causal inference: for each treated unit \(i\), we observe \(Y_i^1\) but not \(Y_i^0\). The treatment effect for unit \(i\) is \(\delta_i = Y_i^1 - Y_i^0\), but we can only observe half of this difference.

Matching addresses this by finding a control unit \(j\) with covariate values identical (or very similar) to unit \(i\). Under CIA, this control unit’s outcome \(Y_j\) serves as a valid estimate of unit \(i\)’s counterfactual outcome \(Y_i^0\).

For exact matching, we require that the matched control has exactly the same covariate values:

\[X_{j(i)} = X_i\]

where \(j(i)\) denotes the control unit matched to treated unit \(i\).

The matching estimator

Once we have found matches, the ATT estimator is simply the average difference between each treated unit’s outcome and its match’s outcome:

\[\hat{\delta}_{ATT} = \frac{1}{N_T} \sum_{D_i=1} \left( Y_i - Y_{j(i)} \right)\]

This estimator has an intuitive interpretation: for each treated unit, we estimate its individual treatment effect by comparing its outcome to what a similar untreated unit achieved. We then average these individual effects.

Note that this is an estimator of the ATT, not the ATE, because we are averaging over the treated units only. Each treated unit gets equal weight, and the implicit weighting reflects the distribution of \(X\) among the treated.

Example: job training and earnings

Consider a job training program where we want to estimate the effect on earnings. We have data on 5 trainees and 8 non-trainees, with age as the only covariate.

Trainees:

Unit

Age

Earnings

1

18

$9,500

2

24

$11,000

3

27

$11,750

4

29

$12,250

5

33

$13,250

Mean

26.2

$11,550

Non-Trainees:

Unit

Age

Earnings

1

40

$13,500

2

24

$9,300

3

44

$14,500

4

18

$7,800

5

29

$10,500

6

33

$11,550

7

27

$10,100

8

47

$15,550

Mean

32.8

$11,600

The naive comparison suggests the program has no effect: trainees earn slightly less on average ($11,550 vs $11,600). But notice that trainees are younger on average (26.2 vs 32.8 years). Since earnings typically rise with age, this comparison is confounded—the non-trainee group includes older, higher-earning individuals who pull up the control mean.

Creating the matched sample

For exact matching on age, we search the non-trainee table for a control unit with the same age as each trainee:

Trainee (Unit)

Age

Trainee Earnings

Match (Unit)

Match Earnings

1

18

$9,500

4

$7,800

2

24

$11,000

2

$9,300

3

27

$11,750

7

$10,100

4

29

$12,250

5

$10,500

5

33

$13,250

6

$11,550

After matching, the matched control group has the same age distribution as the trainees. The groups are now balanced on age, and therefore exchangeable with respect to this covariate. Non-trainees 1, 3, and 8 (ages 40, 44, 47) are dropped because no trainee shares their age.

The mean earnings of the matched controls is $9,850, compared to $11,550 for trainees. The matching estimate of the ATT is:

\[\hat{\delta}_{ATT} = \$11,550 - \$9,850 = \$1,700\]

Once we compare trainees to non-trainees of the same age, the program increases earnings by $1,700—an effect that was completely hidden by the naive comparison.

Limitations of exact matching

Exact matching inherits the curse of dimensionality from subclassification. With multiple covariates, finding exact matches becomes increasingly difficult:

  • With continuous covariates (like age measured in days), exact matches may be impossible

  • With many discrete covariates, the probability of finding an exact match drops rapidly

  • Some treated units may have no exact match in the control group

When exact matches cannot be found, we must either:

  1. Drop unmatched treated units (reducing sample size and potentially introducing selection bias)

  2. Use approximate matching, which we discuss next

4. Approximate matching

In practice, exact matches are often unavailable. Approximate matching relaxes the requirement of identical covariate values, instead matching on units that are “close” in some sense.

The need for distance metrics

When we cannot find an exact match, we seek the “nearest” control unit. But what does “nearest” mean when comparing multiple covariates?

With a single covariate, distance is straightforward: the distance between ages 25 and 27 is simply 2 years. But with multiple covariates—say, age and income—we need a way to combine distances across dimensions. Is someone who differs by 2 years in age and $1,000 in income “closer” or “farther” than someone who differs by 5 years in age and $200 in income?

This is where distance metrics come in. A distance metric provides a principled way to measure the overall similarity between two units based on their covariate values.

Distance metrics

The most common distance metrics are:

Euclidean Distance

The standard “straight line” distance in covariate space:

\[||X_i - X_j|| = \sqrt{(X_i - X_j)'(X_i - X_j)} = \sqrt{\sum_{n=1}^{K} (X_{ni} - X_{nj})^2}\]

The problem with Euclidean distance is that it treats all covariates equally. If age is measured in years (range 18-65) and income in dollars (range $0-$500,000), income will dominate the distance calculation simply because of its larger scale.

Normalized Euclidean Distance

To address the scale problem, we can standardize each covariate by its variance:

\[||X_i - X_j|| = \sqrt{\sum_{n=1}^{K} \frac{(X_{ni} - X_{nj})^2}{\hat{\sigma}_n^2}}\]

where \(\hat{\sigma}_n^2\) is the sample variance of covariate \(n\). Now a one-standard-deviation difference in any covariate contributes equally to the distance.

Mahalanobis Distance

The most sophisticated option accounts for correlations between covariates:

\[||X_i - X_j|| = \sqrt{(X_i - X_j)'\hat{\Sigma}_X^{-1}(X_i - X_j)}\]

where \(\hat{\Sigma}_X\) is the sample covariance matrix of the covariates. Mahalanobis distance not only standardizes by variance but also accounts for the correlation structure among covariates.

Comparison of distance metrics

Metric

Formula

Advantages

Disadvantages

Euclidean

\(\sqrt{\sum (X_{ni} - X_{nj})^2}\)

Simple, intuitive

Scale-dependent

Normalized Euclidean

\(\sqrt{\sum \frac{(X_{ni} - X_{nj})^2}{\sigma_n^2}}\)

Scale-invariant

Ignores correlations

Mahalanobis

\(\sqrt{(X_i-X_j)'\Sigma^{-1}(X_i-X_j)}\)

Scale-invariant, accounts for correlations

Requires invertible covariance matrix

Nearest neighbor matching

Once we have chosen a distance metric, nearest neighbor matching proceeds as follows:

  1. For each treated unit \(i\), compute the distance to all control units

  2. Select the control unit \(j(i)\) with the smallest distance to unit \(i\)

  3. Use \(Y_{j(i)}\) as the imputed counterfactual for unit \(i\)

The ATT estimator remains:

\[\hat{\delta}_{ATT} = \frac{1}{N_T} \sum_{D_i=1} \left( Y_i - Y_{j(i)} \right)\]

Matching with multiple neighbors: Sometimes we may want to use more than one control unit as a match. If we find \(M\) nearest neighbors for each treated unit, we average their outcomes:

\[\hat{\delta}_{ATT} = \frac{1}{N_T} \sum_{D_i=1} \left( Y_i - \frac{1}{M} \sum_{m=1}^{M} Y_{j_m(i)} \right)\]

Using multiple matches reduces variance (more data points) but may increase bias (matches are farther away on average).

Matching discrepancies and bias

With approximate matching, the matched control unit typically does not have exactly the same covariate values as the treated unit: \(X_{j(i)} \neq X_i\). This difference is called the matching discrepancy.

Matching discrepancies introduce bias because the control unit’s outcome reflects not just what the treated unit would have experienced under control, but also any systematic differences associated with having different covariate values.

The key insight is that matching discrepancies tend to shrink as the sample size grows—with more potential controls, we can typically find closer matches. In the limit of an infinite control pool, exact matching becomes feasible and the bias disappears.

In finite samples, researchers should:

  • Examine the quality of matches by comparing covariate distributions before and after matching

  • Use covariate balance diagnostics to assess whether matching has successfully balanced the groups

  • Consider whether remaining imbalances could materially affect the estimated treatment effect

The practical takeaway is that approximate matching is a valuable tool, but researchers should be transparent about match quality and the potential for residual bias.


Part II: Application

In Part I we developed the theory of subclassification and matching: the conditional independence assumption identifies causal effects when treatment depends only on observed covariates, subclassification conditions by stratifying on those covariates, and matching imputes counterfactuals from similar control units.

We now apply these methods to simulated data from the Online Retail Simulator. The simulator provides both potential outcomes—an omniscient view that we would not have in real data—enabling numerical verification of the theoretical results from Part I. We know exactly how treatment was assigned. That knowledge — the assignment mechanism — is what makes causal inference possible. Understanding the assignment mechanism tells us what confounding looks like, why naive comparison fails, and which methods can fix it. We use production-grade implementations from the Impact Engine to apply both methods.

[1]:
# Standard library
import inspect

# Third-party packages
from impact_engine_measure import evaluate_impact, load_results
from IPython.display import Code
from online_retail_simulator import simulate, load_job_results
import pandas as pd

# Local imports
from support import (
    compute_ground_truth_att,
    create_confounded_treatment_multi,
    plot_balance_love_plot,
    plot_covariate_imbalance,
    plot_method_comparison,
    plot_strata_convergence,
    plot_treatment_rates,
    sweep_strata,
)

1. Business context

We return to the content optimization scenario from the previous lecture. An e-commerce company optimizes product listings to boost sales. In the DAGs lecture, confounding came from a single binary variable (product quality High/Low). Here we face a more realistic setting where the company’s selection of which products to optimize depends on three continuous product characteristicsquality_score, price, and impressions.

Variable

Notation

Description

Treatment

\(D\)

Content optimization (1 = optimized, 0 = not)

Outcome

\(Y\)

Product revenue

Covariates

\(X_1, X_2, X_3\)

quality_score, price, impressions

Potential outcomes

\(Y^0, Y^1\)

Revenue without / with optimization

2. Data generation

We simulate 5,000 products with a single day of sales data. The simulation produces baseline metrics — including quality_score, price, and impressions — that will serve as the covariates driving the assignment mechanism.

[2]:
! cat "config_simulation.yaml"
STORAGE:
  PATH: output

RULE:
  PRODUCTS:
    FUNCTION: simulate_products_rule_based
    PARAMS:
      num_products: 5000
      seed: 42

  METRICS:
    FUNCTION: simulate_metrics_rule_based
    PARAMS:
      date_start: "2024-11-15"
      date_end: "2024-11-15"
      sale_prob: 0.7
      seed: 42

PRODUCT_DETAILS:
  FUNCTION: simulate_product_details_mock
[3]:
# Run simulation and load results
job_info = simulate("config_simulation.yaml")
metrics = load_job_results(job_info)["metrics"]

print(f"Metrics records: {len(metrics)}")
print(f"Unique products: {metrics['product_identifier'].nunique()}")
Metrics records: 5000
Unique products: 5000

3. The assignment mechanism

The optimization team did not randomly select products for content optimization. Instead, they prioritized struggling products — those with low quality scores, low prices, and few impressions. This is a sensible business strategy: invest resources where the need is greatest.

But this strategic selection has a consequence for causal inference. The treatment and control groups differ systematically before any treatment is applied. Products selected for optimization already had lower baseline revenue — not because optimization hurts them, but because the company targeted products that were struggling on multiple dimensions.

Operationalizing selection with a logistic model

To formalize this selection process, we need a function that maps product characteristics to a treatment probability. The probability must lie in \([0, 1]\), which rules out a simple linear model. Instead, we use the logistic (sigmoid) function:

\[\Pr(D = 1 \mid X) = \frac{1}{1 + e^{-z}}, \quad \text{where} \quad z = \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_K X_K\]

The logistic function has three key properties: (1) its output is always in \([0, 1]\), (2) when \(z = 0\) the probability is exactly \(0.5\), and (3) the sign of each coefficient \(\beta_k\) determines the direction of its effect — negative coefficients mean that lower covariate values increase treatment probability.

Selection coefficients

In our setting, the covariates are standardized (zero mean, unit variance) so that each contributes on a common scale. The linear index is:

\[z = -0.5 \cdot \text{quality\_score}_z \;-\; 0.8 \cdot \text{price}_z \;-\; 0.3 \cdot \text{impressions}_z\]

All three coefficients are negative, meaning lower covariate values increase treatment probability. The table below maps each coefficient to its business interpretation:

Covariate

Coefficient

Direction

Business Interpretation

quality_score

\(-0.5\)

Lower quality \(\rightarrow\) higher \(\Pr(\text{treat})\)

Struggling products prioritized

price

\(-0.8\)

Lower price \(\rightarrow\) higher \(\Pr(\text{treat})\)

Budget products prioritized most strongly

impressions

\(-0.3\)

Fewer impressions \(\rightarrow\) higher \(\Pr(\text{treat})\)

Low-visibility products slightly prioritized

From assignment mechanism to identifying assumption

Because this selection process uses only the three observable covariates quality_score, price, and impressions, the conditional independence assumption holds when we condition on them:

\[(Y^1, Y^0) \perp\!\!\!\perp D \mid X, \quad X = (\text{quality\_score},\; \text{price},\; \text{impressions})\]

There are no unobserved factors driving selection. This is what justifies the conditioning-based methods — subclassification and matching — that we apply below. Both methods work by comparing treated and control products with similar values of \(X\), which is exactly the set of variables that the assignment mechanism operates through.

[4]:
Code(inspect.getsource(create_confounded_treatment_multi), language="python")
[4]:
def create_confounded_treatment_multi(
    metrics_df,
    true_effect=0.5,
    seed=42,
    coef_quality=-0.5,
    coef_price=-0.8,
    coef_impressions=-0.3,
):
    """
    Create confounded treatment assignment based on multiple continuous covariates.

    Aggregates per-product revenue from metrics and uses a logistic model
    on quality_score, price, and impressions to determine treatment
    probability. Products with lower covariate values are more likely
    to receive content optimization, creating confounding on all three
    dimensions simultaneously.

    Parameters
    ----------
    metrics_df : pandas.DataFrame
        Metrics DataFrame with product_identifier, revenue, quality_score,
        price, and impressions columns.
    true_effect : float, optional
        True proportional causal effect of treatment on revenue.
    seed : int, optional
        Random seed for reproducibility.
    coef_quality : float, optional
        Logistic model coefficient for quality_score. Negative values mean
        lower quality increases treatment probability.
    coef_price : float, optional
        Logistic model coefficient for price. Negative values mean lower
        price increases treatment probability.
    coef_impressions : float, optional
        Logistic model coefficient for impressions. Negative values mean
        fewer impressions increases treatment probability.

    Returns
    -------
    pandas.DataFrame
        DataFrame with product_identifier, D, Y_observed, Y0, Y1,
        quality_score, price, and impressions columns.
    """
    rng = np.random.default_rng(seed)

    # Aggregate revenue per product (single day, but pattern supports multiple)
    product_revenue = metrics_df.groupby("product_identifier")["revenue"].sum().reset_index()
    product_revenue.columns = ["product_identifier", "baseline_revenue"]

    # Merge continuous covariates from the metrics (one row per product)
    covariates = (
        metrics_df.groupby("product_identifier")[["quality_score", "price", "impressions"]].first().reset_index()
    )
    df = product_revenue.merge(covariates, on="product_identifier")

    # Standardize covariates so each contributes on a common scale
    qs_z = (df["quality_score"] - df["quality_score"].mean()) / df["quality_score"].std()
    price_z = (df["price"] - df["price"].mean()) / df["price"].std()
    imp_z = (df["impressions"] - df["impressions"].mean()) / df["impressions"].std()

    # Logistic model: covariates → treatment probability via sigmoid
    logit = coef_quality * qs_z + coef_price * price_z + coef_impressions * imp_z
    treatment_prob = 1 / (1 + np.exp(-logit))

    df["D"] = (rng.random(len(df)) < treatment_prob).astype(int)

    # Potential outcomes
    df["Y0"] = df["baseline_revenue"]
    df["Y1"] = df["baseline_revenue"] * (1 + true_effect)

    # Observed outcome via switching equation
    df["Y_observed"] = np.where(df["D"] == 1, df["Y1"], df["Y0"])

    return df[["product_identifier", "D", "Y_observed", "Y0", "Y1", "quality_score", "price", "impressions"]]
[5]:
# Apply the assignment mechanism to generate treatment and potential outcomes
TRUE_EFFECT = 0.5  # 50% revenue boost from content optimization

confounded_products = create_confounded_treatment_multi(
    metrics,
    true_effect=TRUE_EFFECT,
    coef_quality=-0.5,
    coef_price=-0.8,
    coef_impressions=-0.3,
    seed=42,
)

# Save for the Impact Engine pipeline
confounded_products.to_csv("confounded_products.csv", index=False)

print(f"Products: {len(confounded_products)}")
print(f"Treated: {confounded_products['D'].sum()} ({confounded_products['D'].mean():.1%})")
print(f"Control: {(1 - confounded_products['D']).sum():.0f} ({1 - confounded_products['D'].mean():.1%})")
Products: 5000
Treated: 2545 (50.9%)
Control: 2455 (49.1%)
[6]:
plot_treatment_rates(confounded_products, ["quality_score", "price", "impressions"])
../../_images/measure-impact_03-matching-subclassification_lecture_30_0.png

4. What does the naive comparison tell us?

A naive analyst — one who does not know (or ignores) the assignment mechanism — would simply compare mean outcomes across groups, treating the treatment/control split as if it were random.

[7]:
# Naive comparison
treated = confounded_products[confounded_products["D"] == 1]
control = confounded_products[confounded_products["D"] == 0]
naive_estimate = treated["Y_observed"].mean() - control["Y_observed"].mean()

true_att = compute_ground_truth_att(confounded_products)

print("Naive Comparison: E[Y|D=1] - E[Y|D=0]")
print("=" * 50)
print(f"Mean revenue (treated):   ${treated['Y_observed'].mean():,.2f}")
print(f"Mean revenue (control):   ${control['Y_observed'].mean():,.2f}")
print(f"Naive estimate:           ${naive_estimate:,.2f}")
print(f"\nTrue ATT:                 ${true_att:,.2f}")
print(f"Bias:                     ${naive_estimate - true_att:,.2f}")
Naive Comparison: E[Y|D=1] - E[Y|D=0]
==================================================
Mean revenue (treated):   $26.51
Mean revenue (control):   $78.43
Naive estimate:           $-51.93

True ATT:                 $8.84
Bias:                     $-60.76

Why does the naive estimate fail?

The naive estimate is biased because the assignment mechanism selected treated products based on their covariate values. The company chose products with lower quality scores, lower prices, and fewer impressions — and since all three covariates positively predict baseline revenue, the treated group has systematically lower revenue even in the absence of any treatment effect.

This is negative selection bias: the treated group’s lower baseline pulls down their mean outcome, masking (or reversing) the positive treatment effect. In the language of Part I, the CIA tells us that conditioning on \(X\) would remove this bias. The histograms below show why conditioning is needed — the covariate distributions are shifted between groups, exactly as the assignment mechanism’s negative coefficients predict.

[8]:
plot_covariate_imbalance(confounded_products, ["quality_score", "price", "impressions"])
../../_images/measure-impact_03-matching-subclassification_lecture_34_0.png

5. Subclassification with the Impact Engine

The assignment mechanism operates through three continuous covariates. The CIA tells us: condition on these covariates and confounding disappears. Subclassification achieves this by discretizing each covariate into quantile-based bins, creating strata where products within the same bin have similar characteristics. Within each stratum, the assignment mechanism’s influence is approximately removed — products are roughly comparable regardless of treatment status.

With three continuous covariates, the curse of dimensionality from Part I applies directly: with n_strata set to 4, each covariate is split into 4 quantile-based bins, creating \(4^3 = 64\) potential strata — many of which may lack common support. The SubclassificationAdapter from the Impact Engine handles this by dropping strata without both treated and control units.

Configuration-to-theory mapping

The MEASUREMENT.PARAMS fields in the YAML configuration map directly to the components of our causal model. The covariate_columns are the three variables that drive the assignment mechanism — conditioning on them is what makes CIA hold.

YAML Config Field

Part I Concept

treatment_column

Binary treatment indicator \(D\)

covariate_columns

Conditioning set \(X\) from CIA: \((Y^1, Y^0) \perp\!\!\!\perp D \mid X\)

dependent_variable

Observed outcome \(Y\)

n_strata

Number of strata \(K\) in the subclassification estimator

estimand: "att"

ATT weighting: \(\frac{N_T^k}{N_T}\) per stratum

[9]:
! cat "config_subclassification.yaml"
DATA:
  SOURCE:
    type: file
    CONFIG:
      path: confounded_products.csv
  TRANSFORM:
    FUNCTION: passthrough
    PARAMS: {}

MEASUREMENT:
  MODEL: subclassification
  PARAMS:
    treatment_column: D
    covariate_columns:
      - quality_score
      - price
      - impressions
    dependent_variable: Y_observed
    n_strata: 4
    estimand: att
[10]:
# Run the subclassification pipeline
sub_job = evaluate_impact("config_subclassification.yaml", storage_url="./output/subclassification")
sub_result = load_results(sub_job)

sub_estimate = sub_result.impact_results["data"]["impact_estimates"]["treatment_effect"]
n_strata = sub_result.impact_results["data"]["impact_estimates"]["n_strata"]
n_dropped = sub_result.impact_results["data"]["impact_estimates"]["n_strata_dropped"]

print("Subclassification Results")
print("=" * 50)
print(f"Estimated ATT:   ${sub_estimate:,.2f}")
print(f"True ATT:        ${true_att:,.2f}")
print(f"Bias:            ${sub_estimate - true_att:,.2f}")
print(f"\nStrata used: {n_strata}  |  Strata dropped (no common support): {n_dropped}")
Covariate 'impressions': requested 4 bins but got 3 due to duplicate quantile edges.
Subclassification Results
==================================================
Estimated ATT:   $-2.58
True ATT:        $8.84
Bias:            $-11.42

Strata used: 48  |  Strata dropped (no common support): 0
[11]:
# Examine stratum-level details
sub_result.model_artifacts["stratum_details"]
[11]:
stratum n_treated n_control mean_treated mean_control effect
0 0_0_0 131 30 0.000000 0.000000 0.000000
1 0_0_1 57 11 0.000000 0.000000 0.000000
2 0_0_2 42 22 26.694643 29.850909 -3.156266
3 0_1_0 148 39 0.000000 0.000000 0.000000
4 0_1_1 73 20 0.000000 0.000000 0.000000
5 0_1_2 49 44 56.351020 60.221591 -3.870571
6 0_2_0 119 42 0.000000 0.000000 0.000000
7 0_2_1 54 15 0.000000 0.000000 0.000000
8 0_2_2 41 22 206.969634 119.610000 87.359634
9 0_3_0 54 95 0.000000 0.000000 0.000000
10 0_3_1 23 51 0.000000 0.000000 0.000000
11 0_3_2 14 61 1206.690000 784.488033 422.201967
12 1_0_0 86 43 0.000000 0.000000 0.000000
13 1_0_1 39 27 0.000000 0.000000 0.000000
14 1_0_2 30 14 33.486500 25.213571 8.272929
15 1_1_0 92 52 0.000000 0.000000 0.000000
16 1_1_1 37 22 0.000000 0.000000 0.000000
17 1_1_2 40 25 58.108125 82.693600 -24.585475
18 1_2_0 118 71 0.000000 0.000000 0.000000
19 1_2_1 40 30 0.000000 0.000000 0.000000
20 1_2_2 40 53 152.314125 123.422264 28.891861
21 1_3_0 59 134 0.000000 0.000000 0.000000
22 1_3_1 28 72 0.000000 0.000000 0.000000
23 1_3_2 25 77 434.347800 937.421948 -503.074148
24 2_0_0 110 71 0.000000 0.000000 0.000000
25 2_0_1 48 38 0.000000 0.000000 0.000000
26 2_0_2 37 53 36.891486 21.373585 15.517902
27 2_1_0 104 79 0.000000 0.000000 0.000000
28 2_1_1 47 35 0.000000 0.000000 0.000000
29 2_1_2 35 44 45.437571 63.328864 -17.891292
30 2_2_0 77 63 0.000000 0.000000 0.000000
31 2_2_1 34 42 0.000000 0.000000 0.000000
32 2_2_2 27 37 71.136111 126.768649 -55.632538
33 2_3_0 67 91 0.000000 0.000000 0.000000
34 2_3_1 24 50 0.000000 0.000000 0.000000
35 2_3_2 9 45 407.356667 434.927333 -27.570667
36 3_0_0 85 89 0.000000 0.000000 0.000000
37 3_0_1 53 39 0.000000 0.000000 0.000000
38 3_0_2 32 64 24.552188 19.254844 5.297344
39 3_1_0 61 64 0.000000 0.000000 0.000000
40 3_1_1 25 42 0.000000 0.000000 0.000000
41 3_1_2 25 47 53.860200 39.481277 14.378923
42 3_2_0 71 95 0.000000 0.000000 0.000000
43 3_2_1 36 52 0.000000 0.000000 0.000000
44 3_2_2 26 45 165.887308 153.730667 12.156641
45 3_3_0 41 90 0.000000 0.000000 0.000000
46 3_3_1 18 59 0.000000 0.000000 0.000000
47 3_3_2 14 49 209.169643 396.562653 -187.393010

How many strata? Bias vs. the curse of dimensionality

The choice of n_strata controls the bias-variance tradeoff discussed in Part I. With few strata per covariate, each bin is wide — treated and control products within the same stratum may still differ substantially, leaving residual confounding. Increasing the number of strata narrows the bins and reduces this within-stratum bias.

But with three covariates, the total number of strata grows as \(K^3\). As \(K\) increases, many strata end up with only treated or only control products — a common support violation. These strata must be dropped, meaning the estimate is based on an increasingly selective subset of the data. The figure below sweeps across different values of \(K\) to visualize both effects simultaneously.

[12]:
# Sweep across strata counts to show convergence and common support loss
covariates = ["quality_score", "price", "impressions"]
convergence_df = sweep_strata(confounded_products, [2, 3, 4, 5, 7, 10, 15, 20], covariates)
plot_strata_convergence(convergence_df, true_att)
../../_images/measure-impact_03-matching-subclassification_lecture_41_0.png

The top panel shows the estimated ATT converging toward the true effect as \(K\) increases — finer bins reduce within-stratum confounding, just as the theory predicts. But the bottom panel reveals the cost: the fraction of strata dropped for lack of common support rises sharply. With three covariates, the total strata grow as \(K^3\), and the fixed sample of 5,000 products cannot populate them all. This is the curse of dimensionality in action — more strata improve bias but eventually exhaust the data.

6. Nearest neighbor matching with the Impact Engine

Subclassification discretizes the covariate space into bins, which can be lossy — two products in the same bin may still differ substantially on the covariates that drove treatment selection. Matching takes a different approach to the same problem.

Instead of binning, matching directly implements the CIA logic: for each treated product, find a control product with the most similar covariate profile. If the match is close enough, the assignment mechanism’s influence is effectively neutralized — we are comparing two products that would have had nearly the same treatment probability. The NearestNeighbourMatchingAdapter wraps causalml’s NearestNeighborMatch, using a distance metric on the covariates to find closest matches. Because the three covariates have different scales and different coefficients in the assignment mechanism, the distance metric must account for scale differences.

Configuration-to-theory mapping

The covariate_columns define the dimensions of the distance metric — these must include all variables that drive the assignment mechanism, so that conditional on a close match, treatment is effectively random.

YAML Config Field

Part I Concept

treatment_column

Binary treatment indicator \(D\)

covariate_columns

Dimensions of the distance metric: \(\|X_i - X_j\|\)

dependent_variable

Observed outcome \(Y\)

caliper

Maximum matching discrepancy threshold (Section 4)

replace

Matching with replacement — bias-variance tradeoff

ratio

\(M\) nearest neighbors: \(\frac{1}{M}\sum_{m=1}^{M} Y_{j_m(i)}\)

[13]:
! cat "config_matching.yaml"
DATA:
  SOURCE:
    type: file
    CONFIG:
      path: confounded_products.csv
  TRANSFORM:
    FUNCTION: passthrough
    PARAMS: {}

MEASUREMENT:
  MODEL: nearest_neighbour_matching
  PARAMS:
    treatment_column: D
    covariate_columns:
      - quality_score
      - price
      - impressions
    dependent_variable: Y_observed
    caliper: 0.2
    replace: true
    ratio: 1
    random_state: 42
[14]:
# Run the nearest-neighbour matching pipeline
nn_job = evaluate_impact("config_matching.yaml", storage_url="./output/matching")
nn_result = load_results(nn_job)

nn_att = nn_result.impact_results["data"]["impact_estimates"]["att"]
nn_atc = nn_result.impact_results["data"]["impact_estimates"]["atc"]
nn_ate = nn_result.impact_results["data"]["impact_estimates"]["ate"]
nn_att_se = nn_result.impact_results["data"]["impact_estimates"]["att_se"]

print("Nearest Neighbor Matching Results")
print("=" * 50)
print(f"ATT:   ${nn_att:,.2f}  (SE: ${nn_att_se:,.2f})")
print(f"ATC:   ${nn_atc:,.2f}")
print(f"ATE:   ${nn_ate:,.2f}")
print(f"\nTrue ATT: ${true_att:,.2f}")
print(f"Bias:     ${nn_att - true_att:,.2f}")
Nearest Neighbor Matching Results
==================================================
ATT:   $10.45  (SE: $4.16)
ATC:   $20.68
ATE:   $15.47

True ATT: $8.84
Bias:     $1.61

7. Which method best recovers the true effect?

Because we built the assignment mechanism ourselves, we know the true ATT. This lets us verify that conditioning on the covariates that drive treatment selection — via either subclassification or matching — recovers the true causal effect. The naive estimate’s failure confirms that the assignment mechanism creates real confounding; the methods’ success confirms that conditioning on the right variables removes it.

[15]:
plot_method_comparison(
    {"Naive": naive_estimate, "Subclassification": sub_estimate, "NN Matching": nn_att},
    true_att,
)
../../_images/measure-impact_03-matching-subclassification_lecture_48_0.png
[16]:
# Summary statistics table
summary = pd.DataFrame(
    {
        "Method": ["Naive", "Subclassification", "NN Matching"],
        "Estimate ($)": [naive_estimate, sub_estimate, nn_att],
        "Error ($)": [naive_estimate - true_att, sub_estimate - true_att, nn_att - true_att],
        "% Error": [
            (naive_estimate - true_att) / true_att * 100,
            (sub_estimate - true_att) / true_att * 100,
            (nn_att - true_att) / true_att * 100,
        ],
    }
)
summary["Estimate ($)"] = summary["Estimate ($)"].map(lambda x: f"${x:,.2f}")
summary["Error ($)"] = summary["Error ($)"].map(lambda x: f"${x:,.2f}")
summary["% Error"] = summary["% Error"].map(lambda x: f"{x:+.1f}%")

print(f"True ATT: ${true_att:,.2f}")
print()
summary
True ATT: $8.84

[16]:
Method Estimate ($) Error ($) % Error
0 Naive $-51.93 $-60.76 -687.7%
1 Subclassification $-2.58 $-11.42 -129.2%
2 NN Matching $10.45 $1.61 +18.3%

8. Diagnostics & extensions

A key advantage of matching is that we can directly assess whether the method has achieved its goal — removing the assignment mechanism’s influence on covariate distributions. The diagnostics below evaluate covariate balance before and after nearest neighbor matching.

Covariate balance diagnostics

If matching successfully removes the assignment mechanism’s influence, the covariate distributions of matched treated and control groups should be indistinguishable. The standardized mean difference (SMD) measures residual imbalance: the difference in covariate means between groups, normalized by the pooled standard deviation. An SMD below 0.1 suggests that the assignment mechanism’s effect on that covariate has been effectively neutralized.

[17]:
# Print balance tables
balance_before = nn_result.model_artifacts["balance_before"]
balance_after = nn_result.model_artifacts["balance_after"]

print("Balance BEFORE matching:")
print(balance_before)
print("\nBalance AFTER matching:")
print(balance_after)
Balance BEFORE matching:
           Control        Treatment      SMD
0             2455             2545
1    38.29 (51.93)    24.43 (33.24)  -0.3178
2  263.80 (364.50)  111.08 (163.04)  -0.5409
3      0.86 (0.04)      0.84 (0.05)  -0.3717

Balance AFTER matching:
           Control        Treatment      SMD
0             2538             2538
1    24.46 (33.22)    24.43 (33.24)  -0.0009
2  109.34 (162.80)  110.22 (162.45)   0.0055
3      0.84 (0.05)      0.84 (0.05)  -0.0019
[18]:
plot_balance_love_plot(balance_before, balance_after)
../../_images/measure-impact_03-matching-subclassification_lecture_53_0.png

Additional resources