This dissertation encompasses the development and application of the experiment-selector cross-validated targeted maximum likelihood estimator (ES-CVTMLE) for analyzing hybrid randomized-external data studies. The goal of these hybrid designs is to augment a small randomized controlled trial (RCT) with external data – in the form of the control arm(s) of previous trials or real-world healthcare data (RWD) – in order to increase power. Of course, inclusion of RWD may also increase the causal gap, defined as the difference between the causal effect of interest and the statistical parameter that we will estimate from the data. The primary statistical challenges are 1) excluding external data that would introduce bias of a magnitude large enough to worsen coverage for the causal effect while still including unbiased external data frequently enough to improve power and 2) constructing confidence intervals that appropriately reflect that the causal gap may not be zero when external data are integrated.
In Chapter 1, we describe the development of the ES-CVTMLE methodology, focusing on the case where only external controls are available. We consider two methods of estimating the causal gap: 1) a function of the difference in conditional mean outcome under control between the RCT and combined experiments and 2) the estimated average treatment effect on a negative control outcome. We then define criteria for selecting the experiment (RCT alone or RCT combined with external data) that optimizes the estimated bias-variance tradeoff. To separate the data used for experiment selection from the data used for effect estimation, we develop an experiment-selector cross-validated targeted maximum likelihood estimator. We define the asymptotic distribution of the ES-CVTMLE under varying magnitudes of bias and construct confidence intervals by Monte Carlo simulation. We demonstrate the performance of the ES-CVTMLE compared to three other estimators for hybrid randomized-external data designs using simulations and a re-analysis of the LEADER trial of the effect of liraglutide versus placebo on cardiovascular outcomes.
In Chapter 2, we describe the development of the EScvtmle R software package to implement the method described in Chapter 1. The software package also extends this methodology to allow for integration of external data participants with both the active treatment and control arms of the trial. We include vignettes demonstrating use of the EScvtmle package with the publicly available WASH Benefits Bangladesh cluster RCT dataset.
The real data examples in Chapters 1 and 2 rely on following the Roadmap for Causal and Statistical Inference, a structured process that guides the design, analysis, and interpretation of studies anywhere on the spectrum from a traditional RCT to a fully observational study. In Chapter 3, we describe this Causal Roadmap to an audience of clinical and translational researchers. We also extend the Roadmap framework to consider how outcome-blind simulations may be used for quantitative comparison of the characteristics of different potential study designs.
Chapter 4 represents the culmination of the previous work; we use a case study of semaglutide and cardiovascular outcomes to demonstrate application of this extended version of the Causal Roadmap to compare study designs involving traditional RCTs with a hybrid randomized-external data design. We demonstrate how following the Causal Roadmap can help to define an external control arm in a way that improves the plausibility of causal identification assumptions. We then use simulations to demonstrate the tradeoffs between each of these potential designs. Finally, we present a real data analysis using the ES-CVTMLE to estimate the effect of oral semaglutide versus standard-of-care on major adverse cardiovascular events based on the PIONEER 6 RCT and considering augmentation with RWD from Optum’s de-identified Clinformatics Data Mart Database (CDM) (2007-2022).