On Model Determination, Prediction and Statistical Learning: The Case of Space-Time Data
Problems of model determination, prediction and statistical learning for space-time data arise in many fields. Evaluation metrics developed for independent data are overly optimistic in the dependent context of space-time data and are unreliable for model selection, tuning and averaging. This dissertation makes three contributions to the field of space-time prediction with applications in air pollution exposure modeling. First, it formalizes the prediction error associated with the spatial interpolation of space-time data and investigates a variety of cross-validation (CV) procedures for estimating that error. Consistent with recent best practice, location-based CV is shown to be appropriate for estimating spatial interpolation error as in our analysis of California wildfire data. Interestingly, commonly held notions of bias-variance trade-off of CV fold size do not trivially apply to dependent data, and we recommend leave-one-location-out (LOLO) CV as the preferred prediction error metric for spatial interpolation.
Second, this evaluation framework was applied to compare the predictive accuracy of ten machine learning algorithms on the spatial interpolation of maximum daily 8-hour average ozone during California wildfires in 2008, the first such analysis of ozone during a wildfire event. Gradient boosting and random forest, both ensembles of tree-based models, were the best performing models, having the lowest LOLO CV estimates of prediction error.
Third, it introduces treeging, a new machine learning algorithm for space-time prediction that enriches the flexible mean structure of regression trees with a kriging covariance model into the base learner of an ensemble algorithm. The combination of a flexible mean structure with the dependence structure of a traditional spatial statistics model yields a prediction tool that performs well across a variety of simulation scenarios and several case studies.