## Semiparametric and Robust Methods for Complex Parameters in Causal Inference

- Author(s): Zheng, Wenjing
- Advisor(s): van der Laan, Mark
- Chambaz, Antoine
- et al.

## Abstract

This dissertation focuses on developing robust semiparametric methods for complex parameters that emerge at the interface of causal inference and biostatistics, with applications to epidemiological and medical research. Specifically, it address three important topics: Part I (chapter 1) presents a framework to construct and analyze group sequential covariate-adjusted response-adaptive (CARA) randomized controlled trials (RCTs) that admits the use of data-adaptive approaches in constructing the randomization schemes and in estimating the conditional response model. This framework adds to the existing literature on CARA RCTs by allowing flexible options in both their design and analysis. Part II (chapters 2 and 3) concerns two parameters that arise in longitudinal causal effect analysis using marginal structural models (MSMs). Chapter 2 presents a targeted maximum likelihood estimator (TMLE) for the the dynamic MSM for the hazard function. This estimator improves upon the existing inverse probability weighted (IPW) estimators by providing efficiency gain and robustness protection against model misspecification. Chap- ter 3 addresses the issue of effect modification (in a MSM) by an effect modifier that is post exposure. This parameter is particularly relevant if an effect modifier of interest is missing at random; or if one wishes to evaluate the effect modification of a second-line-treatment by a post first-line-treatment variable, where assignment of the first-line-treatment shares common determinants with the outcome of interest. We also present a TMLE for this parameter. Part III (chapters 4 and 5) addresses semiparametric inference for mediation analysis. Chapter 4 presents a TMLE estimator for the natural direct and indirect effects in a one-time point setting; it improves upon existing estimators by offering robustness, weakened sensitivity to near positivity violations, and potential applications to situations with high-dimensional mediators. Chapter 5 studies longitudinal mediation analysis with time-varying exposure and mediators. In it, we propose a reformulation of the mediation problem in terms of stochastic interventions, establish an identification formula for the mediation functional, and present a TMLE for this parameter. This chapter contributes to existing literature by presenting a nonparametrically defined parameter of interest in longitudinal mediation and a multiply robust and efficient estimator for it.

Chapter 1: An adaptive trial design allows pre-specified modifications to some aspects of the on-going trial based on analysis of the accruing data, while preserving the validity and integrity of the trial. This flexibility potentially translates into more efficient studies (e.g. shorter duration, fewer subjects) or greater chance of answering clinical questions of interest (e.g. detecting a treatment effect if one exists, broader does-response information, etc). In an adaptive CARA RCT, the treatment randomization schemes are allowed to depend on the patient's pre-treatment covariates, and the investigators have the opportunity to adjust these schemes during the course of the trial based on accruing information, including previous responses, in order to meet some pre-specified objectives. In a group-sequential CARA RCT, such adjustments take place at interim time points given by sequential inclusion of blocks of c patients, where c ≥ 1 is a pre-specified integer. In this chapter, we present a novel group-sequential CARA RCT design and corresponding analytical procedure that admits the use of flexible approaches in constructing randomization schemes and a wide range of data-adaptive techniques in estimating the conditional response model. Under the proposed framework, the sequence of randomization schemes is group-sequentially determined, using the accruing data, by targeting a formal, user- specified optimal randomization design. The parameter of interest is nonparametrically defined and is estimated using the paradigm of targeted minimum loss estimation. We establish that under appropriate empirical process conditions, the resulting sequence of randomization schemes converges to a fixed design, and the proposed estimator is consistent and asymptotically Gaussian, with an asymptotic variance that is estimable from data, thus giving rise to valid confidence intervals of given asymptotic levels. To illustrate the pro- posed framework, we consider LASSO regression in estimating the conditional outcome given treatment and baseline covariates. The asymptotic results ensue under minimal condition on the growth of the dimension of the regression coefficients and mild conditions on the complexity of the classes of randomization schemes.

Chapter 2: In many applications, one is often interested in the effect of a longitudinal exposure on a time-to-event process. In particular, consider a study where subjects are followed over time; in addition to their baseline covariates, at various time points we also record their time-varying exposure of interest, time-varying covariates, and indicators for the event of interest (say death). Time varying confounding is ubiquitous in these situations: the exposure of interest depends on past covariates that confound the effect of the exposure on the outcome of interest, in turn exposure affects future confounders; right censoring may also be present in a study of this nature, often in response to past covariates and exposure. One way to assess the comparative effect of different regimens of interest is to study the hazard as a function of such regimens. The features of this hazard are often encoded in a marginal structural model. This chapter builds upon the work of Petersen, Schwab, Gruber, Blaser, Schomaker, and van der Laan (2014) to present a targeted maximum likelihood estimator for the marginal structural model for the hazard function under longitudinal dynamic interventions. The proposed estimator is efficient and doubly robust, hence offers an improvement over the incumbent IPW estimator.

Chapter 3: A crucial component of comparative effectiveness research is evaluating the modification of an exposure's effect by a given set of baseline covariates (effect modifiers). In complex longitudinal settings where time-varying confounding exists, this effect modification analysis is often performed using a marginal structural model. Generally, the conditioning effect modifiers in a MSM are cast as variables of the observed past. Yet, in some applications the effect modifiers of interest are in fact counterfactual. For in- stance, for a specific value of the first-line treatment, one may wish to evaluate the effect modification of a second-line-treatment by a post first-line-treatment variable, wherein the first-line-treatment assignment shares common determinants with the outcome of interest. In this case a simple stratification on the first-line treatment will only yield effect modification over a subpopulation given by said determinants. Hence, the wished parameter of interest should be formulated in terms of randomization on first-line treatment as well. In another example, the effect modifiers may be subject to missingness, which may depend on other baseline confounders; a simple complete-case analysis may introduce selection bias due to the high correlation of these confounders with the missingness of the effect modifier. In this case, one would formulate the wish parameter of interest in terms of an intervention on missingness. We call these counterfactual effect modifiers. In such situations, analysis by stratification alone may harbor selection bias. In this chapter, we investigate MSM defined by counterfactual effect modifiers. Firstly, we determine the identification of the causal dose-response curve and MSM parameters in this setting. Secondly, we establish the semiparametric efficiency theory for these statistical parameters, and present a substitution-based, semiparametric efficient and doubly robust estimator us- ing the targeted maximum likelihood estimation methodology. However, as we shall see, due to the form of the efficient influence curve, the implementation of this estimator may prove arduous in applications where the effect modifier is high dimensional. To address this problem, our third contribution is a projected influence curve (and the corresponding TMLE estimator), which retains most of the robustness of its efficient peer and can be easily implemented in applications where the use of the efficient influence curve becomes taxing. In addition to these two robust estimators, we also present an IPW estimator, and a non-targeted G-computation estimator.

Chapter 4: In many causal inference problems, one is interested in the direct causal effect of an exposure on an outcome of interest that is not mediated by certain intermediate variables. Robins and Greenland (1992) and Pearl (2001) formalized the definition of two types of direct effects (natural and controlled) under the counterfactual framework. The efficient influence curves (under a nonparametric model) for the various natural effect parameters and their general robustness conditions, as well as an estimating equation based estimator using the efficient influence curve, are provided in Tchetgen Tchetgen and Shpitser (2011a). In this chapter, we apply the targeted maximum likelihood frame- work to construct a semiparametric efficient, multiply robust, substitution estimator for the natural direct effect which satisfies the efficient influence curve equation derived in Tchetgen Tchetgen and Shpitser (2011a). We note that the robustness conditions in Tchetgen Tchetgen and Shpitser (2011a) may be weakened, thereby placing less reliance on the estimation of the mediator density. More precisely, the proposed estimator is asymptotically unbiased if either one of the following holds: i) the conditional mean outcome given exposure, mediator, and confounders, and the mediated mean outcome difference are consistently estimated; (ii) the exposure mechanism given confounders, and the conditional mean outcome are consistently estimated; or (iii) the exposure mechanism and the mediator density, or the exposure mechanism and the conditional distribution of the exposure given confounders and mediator, are consistently estimated. If all three conditions hold, then the effect estimate is asymptotically efficient. Extensions to the natural indirect effect are also discussed.

Chapter 5: In this chapter, we study the effect of a time-varying exposure mediated by a time-varying intermediate variable. More specifically, consider a study where baseline covariates, time-varying treatment, time-varying mediator, time-varying covariates, and an outcome process are observed on subjects that are followed over time. The treatment of interest is influenced by past covariates and mediator, and affects future covariates and mediator. Right censoring, if present, occurs in response to past covariates and treatment. We also allow the outcome to be a time-to-event (say survival) process, in which case, at each time we record whether death has occurred. Due to subtleties that are unique to time-varying exposures and mediators, we reformulate the mediation problem in terms of stochastic interventions, as proposed by Didelez, Dawid, and Geneletti (2006) in the one-time point setting. Upon establishing the estimands of interest, we derive the efficient influence curves and establish their robustness properties. Applying the targeted maxi- mum likelihood methodology, we use these efficient influence curves to construct multi- ply robust and efficient estimators. We also present an IPW estimator and a non-targeted substitution estimator for these parameters.