## Semi-Parametric Estimation in Network Data and Tools for Conducting Complex Simulation Studies in Causal Inference

- Author(s): Sofrygin, Oleg
- Advisor(s): van der Laan, Mark
- et al.

## Abstract

This dissertation is concerned with application of robust semi-parametric methods to problems of estimation in network-dependent data and the conduct of large-scale simulation studies for causal inference research in epidemiological and medical data. Specifically, Chapter 1 presents a modern semi-parametric approach to estimation of causal effects in a population connected by a single social network. The connectivity of the population units will typically imply that the observed data on these units is no longer independent and identically distributed. Moreover, such social settings typically result in highly dimensional data. This chapter contributes to current statistical methodology by presenting an approach that allows valid estimation and inference and addresses the statistical issues specific to such networked population datasets. The framework of semi-parametric estimation, called the targeted maximum likelihood estimation (TMLE), is presented. This framework improves upon the existing methods by offering robustness, weakened sensitivity to near positivity violations, as well as the ability to deal with high-dimensionality issues of social network data. In particular, this approach relies on the accurate reflection of the background knowledge available for a given scientific problem, allowing estimation and inference without having to make unrealistic assumptions about the structure of the data. In addition, this chapter generalizes previous work describing estimation of complex causal parameters, such as the direct treatment effects under interference and the causal effects of interventions on social network structure. Although the past decade has produced many contributions towards estimation of causal effects in social network settings, there has been considerably less research on the topic of variance estimation for such highly-dependent data. This chapter presents an approach to constructing valid inference, providing a variance estimator that is scalable to very large datasets with highly-connected observations. The efficient open-source software implementation of these methods also accompanies this chapter. Chapter 2 presents open-source software tools for conduct of reproducible simulation studies for complex parameters that emerge from application of causal inference methods in epidemiological and medical research. This simulation software is build on the framework of non-parametric structural equation modeling. This chapter also studies simulation-based testing of statistical methods in causal inference for longitudinal data with time-varying exposure and confounding. It contributes to existing literature by presenting a unified syntax for non-parametrically defining complex causal parameters, which can be used as the model-free and agnostic gold standard for comparison of different statistical methods for causal inference. For instance, this chapter provides various examples of specification and evaluation of causal parameters that arise naturally in longitudinal causal effect analyses when using marginal structural models (MSMs). The application of these newly developed software tools to replication of several previously published simulation studies in causal inference are also described. Chapter 3 builds on the work described in Chapter 2 and addresses the issue of dependent data simulation for causal inference research in social network data. In particular, it provides a model-free approach to test the validity of various estimation procedures in simulated network-settings. This chapter first outlines a non-parametric causal model for units connected by a network and provides various applied examples of simulations with social network data. This chapter also showcases a possible application of the highly scalable open-source software implementation of the semi-parametric estimation methods described in Chapter 1. In particular, a large scale social network simulation study is described, and the performance of three dependent-data estimators from Chapter 1 is examined. This simulation study also examines the problem of inference for network-dependent data, specifically, by comparing the performance of the dependent-data TMLE variance estimator from Chapter 1 to the true TMLE variance derived from simulations. Finally, Chapter 3 concludes with a simulation study of an HIV epidemic described in terms of a longitudinal process which evolves over a static network in discrete time-steps among several highly inter-connected communities. The abstracts of the three works which make up this dissertation are reproduced below.

Chapter 1: This chapter describes the robust semi-parametric approach towards estimation and inference for the sample average treatment-specific mean in observational settings where data are collected on a single network of connected units (e.g., in the presence of interference or spillover). Despite recent advances, many of the currently used statistical methods rely on assumption of a specific parametric model for the outcome, even though some of the most important statistical assumptions required by these models are most likely violated in the observational network data settings, resulting in invalid and anti-conservative statistical inference. In this chapter, we rely on the recent methodological advances for the targeted maximum likelihood estimation (TMLE) for data collected on a single population of causally connected units, to describe an estimation approach that permits for more realistic classes of data-generative models and provides valid statistical inference in the context of such network-dependent data. The approach is applied to an observational setting with a single time point stochastic intervention. We start by assuming that the true observed data-generating distribution belongs to a large class of semi-parametric statistical models. We then impose some restrictions on the possible set of the data-generative distributions that may belong to our statistical model. For example, we assume that the dependence among units can be fully described by the known network, and that the dependence on other units can be summarized via some known (but otherwise arbitrary) summary measures. We show that under our modeling assumptions, our estimand is equivalent to an estimand in a hypothetical IID data distribution, where the latter distribution is a function of the observed network data-generating distribution. With this key insight in mind, we show that the TMLE for our estimand in dependent network data can be described as a certain IID data TMLE algorithm, also resulting in a new simplified approach to conducting statistical inference. We demonstrate the validity of our approach in a network simulation study. We also extend prior work on dependent-data TMLE towards estimation of novel causal parameters, e.g., the unit-specific direct treatment effects under interference and the effects of interventions that modify the initial network structure.

Chapter 2: This chapter introduces the \pkg{simcausal} \proglang{R} package - an open-source software tool

for specification and simulation of complex longitudinal data structures that are based on non-parametric

structural equation models. The package aims to provide a flexible tool for

simplifying the conduct of transparent and reproducible simulation studies, with a particular emphasis on the types of

data and interventions frequently encountered in real-world causal inference problems, such as, observational data with

time-dependent confounding, selection bias, and random monitoring processes. The package interface allows for concise

expression of complex functional dependencies between a large number of nodes, where each node may represent a

measurement at a specific time point. The package allows for specification and simulation of counterfactual data

under various user-specified interventions (e.g., static, dynamic, deterministic, or stochastic).

In particular, the interventions may represent exposures to treatment regimens, the occurrence or non-occurrence of

right-censoring events, or of clinical monitoring events. Finally, the package enables the computation of a selected

set of user-specified features of the distribution of the counterfactual data that represent common causal quantities

of interest, such as, treatment-specific means, the average treatment effects and coefficients from working marginal

structural models. The applicability of \pkg{simcausal} is demonstrated by replicating the results of two published

simulation studies.

Chapter 3: The past decade has seen an increasing body of literature devoted to the

estimation of causal effects in network-dependent data. However, the

validity of many classical statistical methods in such data is often

questioned. There is an emerging need for objective and practical ways

to assess which causal methodologies might be applicable and valid in

such novel network-based datasets. In this chapter we describe a set

of tools implemented as part of the \pkg{simcausal} \proglang{R}

package that allow simulating data based on the non-parametric structural equation model for connected units.

We also provide examples of how these simulations may be applied to evaluation of different statistical

methods for estimation of causal effects in such data. In particular,

these simulation tools are targeted to the types of data and

interventions frequently encountered in real-world causal

inference research in social networks, such as,

observational studies with spill-over or interference. We developed a

novel \proglang{R} language interface which simplifies the specification of network-based

functional relationships between connected units. Moreover, this

network-based syntax can be combined with the syntax for specifying

longitudinal data structures, allowing for

simulations of network-based processes that evolve in time (e.g.,

contagion in epidemic modeling).

We provide various examples of simulation studies that involve units

connected by various network models. These simulations were designed to

mimic the types of studies one might conduct in real life with the aim

of answering specific causal public health questions.

We also demonstrate one application of these new tools by conducting a

simulation study that compares the performance of three estimators of

the counterfactual mean outcome in a network-dependent data setting.

Finally, we describe a simulation study with longitudinal data that mimics

a spread of HIV epidemic over time for highly inter-connected communities.