With the rise of large and fine-grained data sets, there is a desire for researchers, physicians, businesses, and policymakers to estimate the treatment effect heterogeneity across individuals and contexts at an ever-greater precision to effectively allocate resources, to adequately assign treatments, and to understand the underlying causal mechanism. In this thesis, we provide tools for estimating and understanding the treatment heterogeneity.
Chapter 1 introduces a unifying framework for many estimators of the Conditional Average Treatment Effect (CATE), a function that describes the treatment heterogeneity. We introduce meta-learners as algorithms that can be combined with any machine learning/regression method to estimate the CATE. We also propose a new meta-learner, the X-learner, that can adapt to structural properties such as the smoothness and sparsity of the underlying treatment effect. We then present its desirable properties through simulations and theory and apply it to two field experiments.
As part of this thesis, we created an R package, causalToolbox, that implements eight CATE estimators and several tools that are useful to estimate the CATE and understand the underlying causal mechanism. Chapter 2 focuses on the causalToolbox package and explains how the package is structured and implemented. The package uses the same syntax for all implemented CATE estimators. That makes it easy for appliers to switch between estimators and compare different estimators on a given data set. We give examples of how it can be used to find a well-performing estimator for a given data set, how confidence intervals for the CATE can be computed, and how estimating the CATE for a unit with many CATE estimators simultaneously can give practitioners a sense for which estimates are unstable and depend heavily on the chosen estimator.
Chapter 3 is an application of the causalToolbox package. It shows how useful it is in a simulation study that has been set up for the Empirical Investigation of Methods for Heterogeneity Workshop at the 2018 Atlantic Causal Inference Conference by Carlos Carvalho, Jennifer Hill, Jared Murray, and Avi Feller, based on the National Study of Learning Mindsets.
When implementing the CATE estimators, we noticed that there was a need for a variation of the Random Forests (RF) algorithm that works particularly well for statistical inference. We designed an R package, forestry, that implements a new version of the RF algorithm and several tools for statistical inference with it. In Chapter 4, we describe the problem that confidence interval estimation with RF can perform poorly in areas where RF are biased or in areas outside of the support of the training data. We then introduce a new method that allows us to screen for points for which our confidence intervals methods should not be used.
CATE estimates can be used to assign treatments to subjects, but in many studies, estimating the CATE is not the ultimate goal. Researchers often want to understand the underlying causal mechanisms. In Chapter 5, we discuss a modification of the RF algorithm that is particularly interpretable and allows practitioners to understand the underlying mechanism better. Usually, RF are based on deep regression trees that are difficult to understand. In this new version of the RF, we use linear response functions and very shallow trees to make the results more easily understandable. The algorithm finds splits in quasi-linear time and locally adapts to the smoothness of the underlying response functions. In an experimental study, we show that it leads to shallow and interpretable trees that compare favorably to other regression estimators on a broad range of real-world data sets.