Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation (C-TMLE) procedure. In this dissertation, we first resolves the computational issue in the widely-used greedy variable selection C-TMLE. Then we further investigate how to extend the discrete, variable selection C-TMLE for a more general model selection purpose.
Chapter 1 begins by introducing the framework of causal inference in observational studies. We introduce the non-parametric structural equation model for modeling the data generating distribution. We briefly review the targeted minimum loss-based estimation (TMLE). We also introduce the general template of C-TMLE and its greedy-search variable selection version.
In chapter 2, we propose the template for scalable variable selection C-TMLEs to overcome the computational burden in the greedy variable selection C-TMLE. The original instantiation of the C-TMLE template can be presented as a greedy forward stepwise C-TMLE algorithm. It does not scale well when the number $p$ of covariates increases drastically. This motivates the introduction of a novel instantiation of the C-TMLE template where the covariates are pre-ordered. Its time complexity is $\mathcal{O}(p)$ as opposed to the original $\mathcal{O}(p^2)$, a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is $\mathcal{O}(p)$ as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy C-TMLE algorithm is unacceptably slow. Simulation studies seem to indicate that our scalable C-TMLE and SL-C-TMLE algorithms work well.
In chapter 3, we extend C-TMLE to a more general model selection problem: we apply C-TMLE to select from a set of continuously-indexed nuisance parameter (the propensity score, PS) estimators. The propensity score models have traditionally been selected based on the goodness-of-fit for the treatment mechanism itself, without consideration of the causal parameter of interest. In contrast, the C-TMLE takes into account information on the causal parameter of interest when selecting a PS model. This ``collaborative learning'' considers variable associations with both treatment and outcome when selecting a PS model in order to minimize a bias-variance trade off in the estimated treatment effect. In this study, we introduce a novel approach for collaborative model selection when using the LASSO estimator for PS estimation in high-dimensional covariate settings. To demonstrate the importance of selecting the PS model collaboratively, we designed quasi-experiments based on a real electronic healthcare database, where only the potential outcomes were manually generated, and the treatment and baseline covariates remained unchanged. Results showed that the C-TMLE algorithm outperformed other competing estimators for both point estimation and confidence interval coverage. In addition, the PS model selected by C-TMLE could be applied to other PS-based estimators, which also resulted in substantive improvement for both point estimation and confidence interval coverage. We illustrate the discussed concepts through an empirical example comparing the effects of non-selective Nonsteroidal anti-inflammatory drugs with selective COX-2 inhibitors on gastrointestinal complications in a population of Medicare beneficiaries.
In chapter 4, we propose using C-TMLE to adaptively truncated the propensity score when there exist practical positivity violations. The positivity assumption, or the experimental treatment assignment (ETA) assumption, is important for identifiability in causal inference. Even if the positivity assumption holds, practical violations of this assumption may jeopardize the finite sample performance of the causal estimator. One of the consequences of practical violations of the positivity assumption is extreme values in the estimated propensity score. A common practice to address this issue is truncating the PS estimate when constructing PS-based estimators. In this study, we propose a novel adaptive truncation method, Positivity-C-TMLE, based on the C-TMLE methodology. We further show how to construct a robust confidence interval by a targeted variance estimator. We demonstrate the outstanding performance of our novel approach in a variety of simulations by comparing it with other commonly studied estimators, for both point estimation and confidence interval coverage. Results show that by adaptively truncating the estimated PS with a more targeted objective function, the Positivity-C-TMLE estimator achieves the best performance for both point estimation and confidence interval coverage among all estimators considered.
The code for all the variations of C-TMLE in this dissertation are publicly available in the \emph{ctmle} R package.