CausalML-OutcomeWeights / Git / [a29fce] /notebooks/Notebook_Application_average

Models:
joseph-gordon/
CausalML-OutcomeWeights
Downloads: 1
[a29fce]: / notebooks / Notebook_Application_average_401k.Rmd
History
Download this file
246 lines (186 with data), 8.1 kB

---
title: "Treatment Effect Estimators as Weighted Outcomes"
subtitle: "Application 401(k) - average effects"
author: "Michael C. Knaus"
date: "11/24"
output: 
  html_notebook:
    toc: true
    toc_float: true
    code_folding: show
---

*Replication comments:* 

- *The many required NxN smoother matrices make this notebook relatively memory intensive (32GB RAM should suffice).* 

- *Running the notebook within the replication docker ensures that results perfectly replicate. Otherwise, results might differ depending on which package versions you use.*


This notebook runs the application described in Section 5.1 of [Knaus (2024)](https://arxiv.org/abs/2411.11559). The first part replicates the results presented in the paper, the second part provides supplementary information

# 1. Replication of paper results

## Getting started

First, load packages and set the seed:

```{r, message = FALSE, warning=FALSE}
if (!require("OutcomeWeights")) install.packages("OutcomeWeights", dependencies = TRUE); library(OutcomeWeights)
if (!require("hdm")) install.packages("hdm", dependencies = TRUE); library(hdm)
if (!require("grf")) install.packages("grf", dependencies = TRUE); library(grf)
if (!require("cobalt")) install.packages("cobalt", dependencies = TRUE); library(cobalt)
if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE); library(tidyverse)
if (!require("viridis")) install.packages("viridis", dependencies = TRUE); library(viridis)
if (!require("gridExtra")) install.packages("gridExtra", dependencies = TRUE); library(gridExtra)

set.seed(1234)
```

Next, load the data. Here we use the 401(k) data of the `hdm` package. However, you can adapt the following code chunk to load any suitable data of your choice. Just make sure to call the treatment `D`, covariates `X`, and instrument `Z`. The rest of the notebook should run without further modifications.

```{r}
data(pension) # Find variable description if you type ?pension in console

# Treatment
D = pension$p401
# Instrument
Z = pension$e401
# Outcome
Y = pension$net_tfa
# Controls
X = model.matrix(~ 0 + age + db + educ + fsize + hown + inc + male + marr + pira + twoearn, data = pension)
var_nm = c("Age","Benefit pension","Education","Family size","Home owner","Income","Male","Married","IRA","Two earners")
colnames(X) = var_nm
```


## Run Double ML

In the following we run double ML with default honest random forest (tuning only increases running time without changing the insights in this application). As standard implementations do currently not allow to extract the outcome smoother matrices, the `OutcomeWeights` package comes with a tailored internal implementation called `dml_with_smoother()`, which is used in the following.


### 2-folds

First, we run all estimators with 2-fold cross-fitting:

```{r}
# 2 folds
dml_2f = dml_with_smoother(Y,D,X,Z,
                           n_cf_folds = 2)
results_dml_2f = summary(dml_2f)
plot(dml_2f)
```

Now, we use the `get_outcome_weights()` method to extract the outcome weights as described in the paper. To illustrate that the algebraic results provided in the paper are indeed numerical equivalences and no approximations, we check whether the weights multiplied by the outcome vector reproduces the conventionally generated point estimates.

```{r}
omega_dml_2f = get_outcome_weights(dml_2f)
cat("ω'Y replicates point etimates?", 
    all.equal(as.numeric(omega_dml_2f$omega %*% Y),
      as.numeric(results_dml_2f[,1])
    ))
```


```{r, echo = F, results='hide'}
# As the `dml_with_smoother()` objects are memory intensive because they store several $N \times N$ smoother matrices, it is convenient to remove the 2-fold one before proceeding. Comment out if you have enough RAM and want to use the objects later on.
rm(dml_2f)
gc()
```



### 5-fold

Run double ML also with 5-fold cross-fitting:

```{r}
# 5 folds
dml_5f = dml_with_smoother(Y,D,X,Z,
                           n_cf_folds = 5)
results_dml_5f = summary(dml_5f)
plot(dml_5f)
```

extract the weights and confirm numerical equivalence:

```{r}
omega_dml_5f = get_outcome_weights(dml_5f)
cat("ω'Y replicates point etimates?", 
    all.equal(as.numeric(omega_dml_5f$omega %*% Y),
      as.numeric(results_dml_5f[,1])
    ))
```


```{r, echo = F, results='hide'}
# As the `dml_with_smoother()` objects are memory intensive because they store several $N \times N$ smoother matrices, it is convenient to remove the 5-fold one before proceeding. Comment out if you have enough RAM and want to use the objects later on.
rm(dml_5f)
gc()
```


## Check covariate balancing

We use the infrastructure of the `cobalt` package to plot Standardized Mean Differences where we need to flip the sign of the untreated outcome weights to make them compatible with the package framework. This is achieved by multiplying the outcome weights by $2 \times D-1$:

```{r, message = F}
threshold = 0.1

create_love_plot = function(title, index) {
  love.plot(
    D ~ X,
    weights = list(
      "2-fold" = omega_dml_2f$omega[index, ] * (2*D-1),
      "5-fold" = omega_dml_5f$omega[index, ] * (2*D-1)
    ),
    position = "bottom",
    title = title,
    thresholds = c(m = threshold),
    var.order = "unadjusted",
    binary = "std",
    abs = TRUE,
    line = TRUE,
    colors = viridis(3), # color-blind-friendly
    shapes = c("circle", "triangle", "diamond")
  )
}

# Now you can call this function for each plot:
love_plot_plr = create_love_plot("PLR", 1)
love_plot_plriv = create_love_plot("PLR-IV", 2)
love_plot_aipw = create_love_plot("AIPW", 3)
love_plot_waipw = create_love_plot("Wald-AIPW", 4)
love_plot_plr
love_plot_plriv
love_plot_aipw
love_plot_waipw
```


Create the combined plot that ends up in the paper as Figure 2:

```{r, results='hide', fig.width=12, fig.height=8}
figure2 = grid.arrange(
  love_plot_plr, love_plot_aipw,
  love_plot_plriv,love_plot_waipw,
  nrow = 2
)
```


# 2. Supplementary results

## 10-folds

Alternatively, we can also apply 10 instead of 5-fold cross-fitting:

```{r}
dml_10f = dml_with_smoother(Y,D,X,Z,n_cf_folds = 10)
results_dml_10f = summary(dml_10f)
omega_dml_10f = get_outcome_weights(dml_10f)
cat("ω'Y replicates point etimates?",
    all.equal(as.numeric(omega_dml_10f$omega %*% Y),
      as.numeric(results_dml_10f[,1])
    ))
rm(dml_10f)
```

We observe that it does sometimes improve over 5-fold and sometimes is worse. 5-folds seem to provide a good compromise between balancing quality and computational speed.

```{r, message = F}
create_love_plot = function(title, index) {
  love.plot(
    D ~ X,
    weights = list(
      "2-fold" = omega_dml_2f$omega[index, ] * (2*D-1),
      "5-fold" = omega_dml_5f$omega[index, ] * (2*D-1),
      "10-fold" = omega_dml_10f$omega[index, ] * (2*D-1)
    ),
    position = "bottom",
    title = title,
    thresholds = c(m = threshold),
    var.order = "unadjusted",
    binary = "std",
    abs = TRUE,
    line = TRUE,
    colors = viridis(4), # color-blind-friendly
  )
}

# Now you can call this function for each plot:
love_plot_plr = create_love_plot("PLR", 1)
love_plot_plriv = create_love_plot("PLR-IV", 2)
love_plot_aipw = create_love_plot("AIPW", 3)
love_plot_waipw = create_love_plot("Wald-AIPW", 4)
love_plot_plr
love_plot_plriv
love_plot_aipw
love_plot_waipw
```



## Weights descriptives

Finally, we investigate descriptives of the weights. Given the results of the paper it is most interesting to observe that the "Sum of weights" of PLR(-IV) and Wald-AIPW indeed are only scale-normalized. However, they all sum to values very close to one.

```{r}
cat("2-fold: \n")
summary(omega_dml_2f)
cat("\n\n5-fold: \n")
summary(omega_dml_5f)
# cat("\n\n10-fold: \n")
# summary(omega_dml_10f)
```



```{r, echo=F}
# This part is relevant if you run the notebooks inside the docker and want to save graphs and image in a shared host volume called shared_files (uncomment and/or adjust on demand):

# ggsave("/home/rstudio/shared_files/Figure2.pdf", plot = figure2, width = 9, height = 6,dpi=300)
# save.image(file = "/home/rstudio/shared_files/Application_401k_average.RData")
```