Statistical Inference and Ensemble Machine Learning for Dependent Data
- Author(s): Davies, Molly Margaret
- Advisor(s): van der Laan, Mark J
- et al.
The focus of this dissertation is on extending targeted learning to settings with complex unknown dependence structure, with an emphasis on applications in environmental science and environmental health.
The bulk of the work in targeted learning and semiparametric inference in general has been with respect to data generated by independent units.
Truly independent, randomized experiments in the environmental sciences and environmental health are rare, and data indexed by time and/or space is quite common.
These scientific disciplines need flexible algorithms for model selection and model combining that can accommodate things like physical process models and Bayesian hierarchical approaches. They also need inference that honestly and realistically handles limited knowledge about dependence in the data.
The goal of the research program reflected in this dissertation is to formalize results and build tools to address these needs.
Chapter 1 provides a brief introduction to the context and spirit of the work contained in this dissertation.
Chapter 2 focuses on Super Learner for spatial prediction. Spatial prediction is an important problem in many scientific disciplines, and plays an especially important role in the environmental sciences. We review the optimality properties of Super Learner in general and discuss the assumptions required in order for them to hold when using Super Learner for spatial prediction.
We present results of a simulation study confirming Super Learner works well in practice under a variety of sample sizes, sampling designs, and data-generating functions.
We also apply Super Learner to a real world, benchmark dataset for spatial prediction methods.
Appendix A contains a theorem extending an existing oracle inequality to the case of fixed design regression.
Chapter 3 describes a new approach to standard error estimation called Sieve Plateau (SP) variance estimation, an approach that allows us to learn from sequences of influence function based variance estimators, even when the true dependence structure is poorly understood. SP variance estimation can be prohibitively computationally expensive if not
implemented with care. Appendix D contains completely general, highly optimized, heavily commented code as a reference for future users.
Chapter 4 uses targeted learning techniques to examine the relationship between ventilation rate and illness absence in a California school district observed over a period of two years. There is much that is unknown about the relationship between ventilation rates and human health outcomes, and there is a particular need to know more with respect to the school environment. It would be helpful for policy makers and indoor environment scientists to have estimates of average classroom illness absence rates when the average ventilation rate in the recent past failed to achieve a variety of different thresholds. The aim of this work is to provide these estimates. These data are challenging to work with, as they constitute a clustered, discontinuous time series with unknown dependence structure at both the classroom and school level. We use Super Learner to estimate the relevant parts of the likelihood; targeted maximum likelihood to estimate the target parameters; and SP variance estimation to obtain standard error estimates.