 Main
SemiParametric Estimation in Network Data and Tools for Conducting Complex Simulation Studies in Causal Inference
 Sofrygin, Oleg
 Advisor(s): van der Laan, Mark
Abstract
This dissertation is concerned with application of robust semiparametric methods to problems of estimation in networkdependent data and the conduct of largescale simulation studies for causal inference research in epidemiological and medical data. Specifically, Chapter 1 presents a modern semiparametric approach to estimation of causal effects in a population connected by a single social network. The connectivity of the population units will typically imply that the observed data on these units is no longer independent and identically distributed. Moreover, such social settings typically result in highly dimensional data. This chapter contributes to current statistical methodology by presenting an approach that allows valid estimation and inference and addresses the statistical issues specific to such networked population datasets. The framework of semiparametric estimation, called the targeted maximum likelihood estimation (TMLE), is presented. This framework improves upon the existing methods by offering robustness, weakened sensitivity to near positivity violations, as well as the ability to deal with highdimensionality issues of social network data. In particular, this approach relies on the accurate reflection of the background knowledge available for a given scientific problem, allowing estimation and inference without having to make unrealistic assumptions about the structure of the data. In addition, this chapter generalizes previous work describing estimation of complex causal parameters, such as the direct treatment effects under interference and the causal effects of interventions on social network structure. Although the past decade has produced many contributions towards estimation of causal effects in social network settings, there has been considerably less research on the topic of variance estimation for such highlydependent data. This chapter presents an approach to constructing valid inference, providing a variance estimator that is scalable to very large datasets with highlyconnected observations. The efficient opensource software implementation of these methods also accompanies this chapter. Chapter 2 presents opensource software tools for conduct of reproducible simulation studies for complex parameters that emerge from application of causal inference methods in epidemiological and medical research. This simulation software is build on the framework of nonparametric structural equation modeling. This chapter also studies simulationbased testing of statistical methods in causal inference for longitudinal data with timevarying exposure and confounding. It contributes to existing literature by presenting a unified syntax for nonparametrically defining complex causal parameters, which can be used as the modelfree and agnostic gold standard for comparison of different statistical methods for causal inference. For instance, this chapter provides various examples of specification and evaluation of causal parameters that arise naturally in longitudinal causal effect analyses when using marginal structural models (MSMs). The application of these newly developed software tools to replication of several previously published simulation studies in causal inference are also described. Chapter 3 builds on the work described in Chapter 2 and addresses the issue of dependent data simulation for causal inference research in social network data. In particular, it provides a modelfree approach to test the validity of various estimation procedures in simulated networksettings. This chapter first outlines a nonparametric causal model for units connected by a network and provides various applied examples of simulations with social network data. This chapter also showcases a possible application of the highly scalable opensource software implementation of the semiparametric estimation methods described in Chapter 1. In particular, a large scale social network simulation study is described, and the performance of three dependentdata estimators from Chapter 1 is examined. This simulation study also examines the problem of inference for networkdependent data, specifically, by comparing the performance of the dependentdata TMLE variance estimator from Chapter 1 to the true TMLE variance derived from simulations. Finally, Chapter 3 concludes with a simulation study of an HIV epidemic described in terms of a longitudinal process which evolves over a static network in discrete timesteps among several highly interconnected communities. The abstracts of the three works which make up this dissertation are reproduced below.
Chapter 1: This chapter describes the robust semiparametric approach towards estimation and inference for the sample average treatmentspecific mean in observational settings where data are collected on a single network of connected units (e.g., in the presence of interference or spillover). Despite recent advances, many of the currently used statistical methods rely on assumption of a specific parametric model for the outcome, even though some of the most important statistical assumptions required by these models are most likely violated in the observational network data settings, resulting in invalid and anticonservative statistical inference. In this chapter, we rely on the recent methodological advances for the targeted maximum likelihood estimation (TMLE) for data collected on a single population of causally connected units, to describe an estimation approach that permits for more realistic classes of datagenerative models and provides valid statistical inference in the context of such networkdependent data. The approach is applied to an observational setting with a single time point stochastic intervention. We start by assuming that the true observed datagenerating distribution belongs to a large class of semiparametric statistical models. We then impose some restrictions on the possible set of the datagenerative distributions that may belong to our statistical model. For example, we assume that the dependence among units can be fully described by the known network, and that the dependence on other units can be summarized via some known (but otherwise arbitrary) summary measures. We show that under our modeling assumptions, our estimand is equivalent to an estimand in a hypothetical IID data distribution, where the latter distribution is a function of the observed network datagenerating distribution. With this key insight in mind, we show that the TMLE for our estimand in dependent network data can be described as a certain IID data TMLE algorithm, also resulting in a new simplified approach to conducting statistical inference. We demonstrate the validity of our approach in a network simulation study. We also extend prior work on dependentdata TMLE towards estimation of novel causal parameters, e.g., the unitspecific direct treatment effects under interference and the effects of interventions that modify the initial network structure.
Chapter 2: This chapter introduces the \pkg{simcausal} \proglang{R} package  an opensource software tool
for specification and simulation of complex longitudinal data structures that are based on nonparametric
structural equation models. The package aims to provide a flexible tool for
simplifying the conduct of transparent and reproducible simulation studies, with a particular emphasis on the types of
data and interventions frequently encountered in realworld causal inference problems, such as, observational data with
timedependent confounding, selection bias, and random monitoring processes. The package interface allows for concise
expression of complex functional dependencies between a large number of nodes, where each node may represent a
measurement at a specific time point. The package allows for specification and simulation of counterfactual data
under various userspecified interventions (e.g., static, dynamic, deterministic, or stochastic).
In particular, the interventions may represent exposures to treatment regimens, the occurrence or nonoccurrence of
rightcensoring events, or of clinical monitoring events. Finally, the package enables the computation of a selected
set of userspecified features of the distribution of the counterfactual data that represent common causal quantities
of interest, such as, treatmentspecific means, the average treatment effects and coefficients from working marginal
structural models. The applicability of \pkg{simcausal} is demonstrated by replicating the results of two published
simulation studies.
Chapter 3: The past decade has seen an increasing body of literature devoted to the
estimation of causal effects in networkdependent data. However, the
validity of many classical statistical methods in such data is often
questioned. There is an emerging need for objective and practical ways
to assess which causal methodologies might be applicable and valid in
such novel networkbased datasets. In this chapter we describe a set
of tools implemented as part of the \pkg{simcausal} \proglang{R}
package that allow simulating data based on the nonparametric structural equation model for connected units.
We also provide examples of how these simulations may be applied to evaluation of different statistical
methods for estimation of causal effects in such data. In particular,
these simulation tools are targeted to the types of data and
interventions frequently encountered in realworld causal
inference research in social networks, such as,
observational studies with spillover or interference. We developed a
novel \proglang{R} language interface which simplifies the specification of networkbased
functional relationships between connected units. Moreover, this
networkbased syntax can be combined with the syntax for specifying
longitudinal data structures, allowing for
simulations of networkbased processes that evolve in time (e.g.,
contagion in epidemic modeling).
We provide various examples of simulation studies that involve units
connected by various network models. These simulations were designed to
mimic the types of studies one might conduct in real life with the aim
of answering specific causal public health questions.
We also demonstrate one application of these new tools by conducting a
simulation study that compares the performance of three estimators of
the counterfactual mean outcome in a networkdependent data setting.
Finally, we describe a simulation study with longitudinal data that mimics
a spread of HIV epidemic over time for highly interconnected communities.
Main Content
Enter the password to open this PDF file:













