Intro to causal forest¶

This notebook aims to explain how causal forest works and some key differences compared to the well-known algorithm random forest.

I will demo the causal forest APIs from EconML using a toy dataset.

Data generation¶

Some synthetic data (100,000 samples and 10 covariates) are generated by following a data generation process that is also presented in a demo notebook in EconML.

Process

$x_i = normal(mean=0, std=1)$
$Y\_x_i = x_i * uniform(low=-3, high=3) + normal(mean=0, std=1)$
$TE = \begin{cases} 8, & \text{if } x_1 > 0.1 \\ 0, & \text{else} \end{cases}$
$propensity = \begin{cases} 0.8, & \text{if } -0.5 < x_2 <0.5 \\ 0.2, & \text{else} \end{cases}$
$T = binomaial(prop=propensity, trials=1)$
$Y = TE * T + Y\_x_i$

In [29]:

Copied!

from IPython.display import Image
Image('diagrams/causal_forest/causal_forest.md.1.png')
from IPython.display import Image
Image('diagrams/causal_forest/causal_forest.md.1.png')

Out[29]:

No description has been provided for this image

In [2]:

Copied!





import numpy as np
import pandas as pd

num_covariates = 10
num_samples = 100000
np.random.seed(42)

df = pd.DataFrame()
for i in range(num_covariates):
  df[f'x{i}'] = np.random.randn(num_samples)  
  df[f'y_x{i}'] = df[f'x{i}'] * np.random.uniform(low=-3, high=3) + np.random.randn(1)

df['TE'] = df['x1'].apply(lambda x: 8 if x>0.1 else 0)
df['propensity'] = df['x2'].apply(lambda x: 0.8 if -0.5<x<0.5 else 0.2)
df['T'] = df['propensity'].apply(lambda x: np.random.binomial(n=1, p=x))
df['Y'] = df['TE'] * df['T'] + df[[col for col in df.columns if 'y_x' in col]].sum(axis=1)

df[[col for col in df.columns if 'y_x' not in col]].describe().transpose().drop(columns='count').drop('propensity')
import numpy as np
import pandas as pd

num_covariates = 10
num_samples = 100000
np.random.seed(42)

df = pd.DataFrame()
for i in range(num_covariates):
  df[f'x{i}'] = np.random.randn(num_samples)  
  df[f'y_x{i}'] = df[f'x{i}'] * np.random.uniform(low=-3, high=3) + np.random.randn(1)

df['TE'] = df['x1'].apply(lambda x: 8 if x>0.1 else 0)
df['propensity'] = df['x2'].apply(lambda x: 0.8 if -0.5

Out[2]:

	mean	std	min	25%	50%	75%	max
x0	0.000967	1.000906	-4.465604	-0.674494	0.002650	0.676915	4.479084
x1	0.003090	1.000497	-4.080833	-0.667894	0.000411	0.672561	4.694473
x2	-0.001496	0.999855	-4.413886	-0.674197	-0.002710	0.672063	4.219366
x3	0.001442	1.002386	-4.301410	-0.675981	0.005557	0.676334	4.293276
x4	-0.006266	1.001158	-4.829436	-0.684352	-0.008103	0.672384	4.301848
x5	-0.001142	1.000885	-4.111276	-0.679754	-0.004396	0.674232	4.295946
x6	0.003354	0.998648	-4.319465	-0.672365	0.008510	0.679251	4.526784
x7	-0.001657	0.998657	-4.386303	-0.676386	0.001485	0.673570	4.329187
x8	-0.003169	1.000302	-3.928834	-0.678930	-0.005792	0.673594	3.980280
x9	-0.002374	0.999370	-4.717265	-0.676115	-0.006132	0.673234	4.276593
TE	3.677760	3.987019	0.000000	0.000000	0.000000	8.000000	8.000000
T	0.430000	0.495078	0.000000	0.000000	0.000000	1.000000	1.000000
Y	-0.554922	6.972513	-27.887161	-5.485575	-1.094181	4.055324	29.675579

In [69]:

Copied!

import plotly.express as px
from IPython.display import Image

px.box(df, x='Y', color='T', width=1200, height=400, title='The distribution of Y split by T').write_image('diagrams/causal_forest/Y_dist.png')
Image('diagrams/causal_forest/Y_dist.png')
import plotly.express as px
from IPython.display import Image

px.box(df, x='Y', color='T', width=1200, height=400, title='The distribution of Y split by T').write_image('diagrams/causal_forest/Y_dist.png')
Image('diagrams/causal_forest/Y_dist.png')

Out[69]:

Causal forest¶

Causal forest is a method from generalised random forest, it can be used to estimate heterogeneous treatment effects. It is composed of causal trees, each of which is very similar to the decision tree that we are all familiar with. However, there are several subtle differences, including splitting criteria, evaluation methods, etc. But why can't we use a decision tree to solve uplift modeling problems? The key issue for the decision tree is that it is not designed to estimate treatment effects, and it introduces generalization bias. Besides, the ground truth is never observed, making it challenging to directly use a decision tree to estimate treatment effects.

In this section, I will walk through some of the key features of causal tree to illustrate the differences between it and the decision tree.

Split criteria¶

Honesty¶

Way of estimation¶

Example¶

In [10]:

Copied!





from sklearn.model_selection import train_test_split
from econml.grf import CausalForest

t = df['T'].to_numpy()
X = df[[f'x{i}' for i in range(10)]].to_numpy()
y = df['Y'].to_numpy()
TE = df['TE'].to_numpy()

X_train, X_test, y_train, y_test, t_train, t_test, TE_train, TE_test = train_test_split(X, y, t, TE, stratify=t, random_state=42)

cf = CausalForest(random_state=42)
cf.fit(X=X_train, y=y_train, T=t_train)
from sklearn.model_selection import train_test_split
from econml.grf import CausalForest

t = df['T'].to_numpy()
X = df[[f'x{i}' for i in range(10)]].to_numpy()
y = df['Y'].to_numpy()
TE = df['TE'].to_numpy()

X_train, X_test, y_train, y_test, t_train, t_test, TE_train, TE_test = train_test_split(X, y, t, TE, stratify=t, random_state=42)

cf = CausalForest(random_state=42)
cf.fit(X=X_train, y=y_train, T=t_train)

Out[10]:

CausalForest(random_state=42)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [30]:

Copied!





import plotly.graph_objects as go
from plotly.subplots import make_subplots

df_test = pd.DataFrame(X_test, columns=[f'x{i}' for i in range(10)])
df_test['TE_test'] = TE_test
df_test['TE_cf'] = cf.predict(X_test)

fig = make_subplots(rows=1, cols=1, x_title='x1', y_title='TE', column_titles=['TE_cf'])

for idx, col in zip([1], ['TE_cf']):
    fig.add_trace(go.Scatter(x=df_test['x1'], y=df_test['TE_cf'], mode='markers', name=col, marker=dict(color='#636efa', size=5)), row=1, col=idx)
    fig.add_trace(go.Scatter(x=df_test['x1'], y=df_test['TE_test'], mode='markers', name='TE_test', marker=dict(color='red', size=3)), row=1, col=idx)

fig.update_layout(width=800, height=500, showlegend=False, title='Blue scatters are predicted TE, red scatters are the ground truth').write_image('diagrams/causal_forest/comparison.png')
Image('diagrams/causal_forest/comparison.png')
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df_test = pd.DataFrame(X_test, columns=[f'x{i}' for i in range(10)])
df_test['TE_test'] = TE_test
df_test['TE_cf'] = cf.predict(X_test)

fig = make_subplots(rows=1, cols=1, x_title='x1', y_title='TE', column_titles=['TE_cf'])

for idx, col in zip([1], ['TE_cf']):
    fig.add_trace(go.Scatter(x=df_test['x1'], y=df_test['TE_cf'], mode='markers', name=col, marker=dict(color='#636efa', size=5)), row=1, col=idx)
    fig.add_trace(go.Scatter(x=df_test['x1'], y=df_test['TE_test'], mode='markers', name='TE_test', marker=dict(color='red', size=3)), row=1, col=idx)

fig.update_layout(width=800, height=500, showlegend=False, title='Blue scatters are predicted TE, red scatters are the ground truth').write_image('diagrams/causal_forest/comparison.png')
Image('diagrams/causal_forest/comparison.png')

Out[30]: