Skip to main content
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Targeted Maximum Likelihood Estimation and Ensemble Learning for Community-Level Data and Healthcare Claims Data


This dissertation discusses the Targeted maximum Likelihood Estimation (TMLE) and ensemble learning for community-level data and healthcare claims data, along with the conduct of simulation studies and practical examples for causal inference research in medical data. Specifically, we resolve two common questions: how to estimate the community-based causal effect of community-level stochastic interventions, and how to take advantage of data-adaptive ensemble learning to problems of estimation in public health data.

Chapter 1 begins by reviewing the targeted maximum likelihood estimation (TMLE). We also provide a more detailed summary to each of the rest of the chapters.

Chapter 2 studies the framework for target maximum likelihood estimation and statistical inference for the causal effects of community-level treatments on individual-level outcomes where the outcomes could be correlated because of the interactions among individuals from the same communities and the shared community-level covariates. This chapter presents a new solution that considers the case in which the treatment mechanism may cause stochastically assigned exposures and the corresponding causal parameter may require a more easily achievable positivity assumption. Given two different structural equation models, we develop two semi-parametric efficient TMLEs for the estimation of such a community-based causal effect. The proposed TMLEs have several crucial advantages. First, both TMLEs can make use of individual level data in the hierarchical setting, and potentially reduce finite sample bias and improve estimator efficiency. Second, the stochastic intervention framework provides a natural way for defining and estimating casual effects where the exposure variables are continuous or discrete with multiple levels, or even cannot be directly intervened on. Also, the positivity assumption needed for our proposed causal parameters can be weaker than the version of positivity required for other casual parameters.

Chapter 3 builds on the work described in Chapter 2 and presents an open-source software tool for implementing TMLE of the average causal effect of community-level intervention(s) at a single time point. This software supports a wide variety of TMLE implementations. For example, the package supports univariate or multivariate arbitrary (i.e., static, dynamic or stochastic) interventions with a binary or continuous outcome. It also allows users to use either weighted intercept-based TMLE or unweighted covariate-based TMLE.

In Chapter 4, we propose a new ensemble approach to gain a better understanding of the natural history of nonalcoholic steatohepatitis (NASH). Super Learner (SL) is an ensemble method that uses V-folds cross-validation to build the optimal weighted combination of the predicted values from a library of user-specified prediction algorithms. Because data-adaptive methods are allowed in a SL library, SL can be used to avoid unrealistic parametric assumptions without overfitting the data in practice. This proposed AUC-maximizing ensemble approach couples each prediction model with a comprehensive feature selection algorithm, including Bayesian risk ratio method, column sparsity based regularization, and L1 regularization.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View