Intro to causal forest¶
This notebook aims to explain how causal forest works and some key differences compared to the well-known algorithm random forest.
I will demo the causal forest APIs from EconML using a toy dataset.
Data generation¶
Some synthetic data (100,000 samples and 10 covariates) are generated by following a data generation process that is also presented in a demo notebook in EconML.
Process
- $x_i = normal(mean=0, std=1)$
- $Y\_x_i = x_i * uniform(low=-3, high=3) + normal(mean=0, std=1)$
- $TE = \begin{cases} 8, & \text{if } x_1 > 0.1 \\ 0, & \text{else} \end{cases}$
- $propensity = \begin{cases} 0.8, & \text{if } -0.5 < x_2 <0.5 \\ 0.2, & \text{else} \end{cases}$
- $T = binomaial(prop=propensity, trials=1)$
- $Y = TE * T + Y\_x_i$
from IPython.display import Image
Image('diagrams/causal_forest/causal_forest.md.1.png')
import numpy as np
import pandas as pd
num_covariates = 10
num_samples = 100000
np.random.seed(42)
df = pd.DataFrame()
for i in range(num_covariates):
df[f'x{i}'] = np.random.randn(num_samples)
df[f'y_x{i}'] = df[f'x{i}'] * np.random.uniform(low=-3, high=3) + np.random.randn(1)
df['TE'] = df['x1'].apply(lambda x: 8 if x>0.1 else 0)
df['propensity'] = df['x2'].apply(lambda x: 0.8 if -0.5<x<0.5 else 0.2)
df['T'] = df['propensity'].apply(lambda x: np.random.binomial(n=1, p=x))
df['Y'] = df['TE'] * df['T'] + df[[col for col in df.columns if 'y_x' in col]].sum(axis=1)
df[[col for col in df.columns if 'y_x' not in col]].describe().transpose().drop(columns='count').drop('propensity')
| mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|
| x0 | 0.000967 | 1.000906 | -4.465604 | -0.674494 | 0.002650 | 0.676915 | 4.479084 |
| x1 | 0.003090 | 1.000497 | -4.080833 | -0.667894 | 0.000411 | 0.672561 | 4.694473 |
| x2 | -0.001496 | 0.999855 | -4.413886 | -0.674197 | -0.002710 | 0.672063 | 4.219366 |
| x3 | 0.001442 | 1.002386 | -4.301410 | -0.675981 | 0.005557 | 0.676334 | 4.293276 |
| x4 | -0.006266 | 1.001158 | -4.829436 | -0.684352 | -0.008103 | 0.672384 | 4.301848 |
| x5 | -0.001142 | 1.000885 | -4.111276 | -0.679754 | -0.004396 | 0.674232 | 4.295946 |
| x6 | 0.003354 | 0.998648 | -4.319465 | -0.672365 | 0.008510 | 0.679251 | 4.526784 |
| x7 | -0.001657 | 0.998657 | -4.386303 | -0.676386 | 0.001485 | 0.673570 | 4.329187 |
| x8 | -0.003169 | 1.000302 | -3.928834 | -0.678930 | -0.005792 | 0.673594 | 3.980280 |
| x9 | -0.002374 | 0.999370 | -4.717265 | -0.676115 | -0.006132 | 0.673234 | 4.276593 |
| TE | 3.677760 | 3.987019 | 0.000000 | 0.000000 | 0.000000 | 8.000000 | 8.000000 |
| T | 0.430000 | 0.495078 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| Y | -0.554922 | 6.972513 | -27.887161 | -5.485575 | -1.094181 | 4.055324 | 29.675579 |
import plotly.express as px
from IPython.display import Image
px.box(df, x='Y', color='T', width=1200, height=400, title='The distribution of Y split by T').write_image('diagrams/causal_forest/Y_dist.png')
Image('diagrams/causal_forest/Y_dist.png')
Causal forest¶
Causal forest is a method from generalised random forest, it can be used to estimate heterogeneous treatment effects. It is composed of causal trees, each of which is very similar to the decision tree that we are all familiar with. However, there are several subtle differences, including splitting criteria, evaluation methods, etc. But why can't we use a decision tree to solve uplift modeling problems? The key issue for the decision tree is that it is not designed to estimate treatment effects, and it introduces generalization bias. Besides, the ground truth is never observed, making it challenging to directly use a decision tree to estimate treatment effects.
In this section, I will walk through some of the key features of causal tree to illustrate the differences between it and the decision tree.
Split criteria¶
Honesty¶
Way of estimation¶
Example¶
from sklearn.model_selection import train_test_split
from econml.grf import CausalForest
t = df['T'].to_numpy()
X = df[[f'x{i}' for i in range(10)]].to_numpy()
y = df['Y'].to_numpy()
TE = df['TE'].to_numpy()
X_train, X_test, y_train, y_test, t_train, t_test, TE_train, TE_test = train_test_split(X, y, t, TE, stratify=t, random_state=42)
cf = CausalForest(random_state=42)
cf.fit(X=X_train, y=y_train, T=t_train)
CausalForest(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
CausalForest(random_state=42)
import plotly.graph_objects as go
from plotly.subplots import make_subplots
df_test = pd.DataFrame(X_test, columns=[f'x{i}' for i in range(10)])
df_test['TE_test'] = TE_test
df_test['TE_cf'] = cf.predict(X_test)
fig = make_subplots(rows=1, cols=1, x_title='x1', y_title='TE', column_titles=['TE_cf'])
for idx, col in zip([1], ['TE_cf']):
fig.add_trace(go.Scatter(x=df_test['x1'], y=df_test['TE_cf'], mode='markers', name=col, marker=dict(color='#636efa', size=5)), row=1, col=idx)
fig.add_trace(go.Scatter(x=df_test['x1'], y=df_test['TE_test'], mode='markers', name='TE_test', marker=dict(color='red', size=3)), row=1, col=idx)
fig.update_layout(width=800, height=500, showlegend=False, title='Blue scatters are predicted TE, red scatters are the ground truth').write_image('diagrams/causal_forest/comparison.png')
Image('diagrams/causal_forest/comparison.png')