Many scientific fields have been changed by rapid technological progress in data collection, storage, and processing. This has greatly expanded the role of statistics in scientific research. The three chapters of this thesis examine core challenges faced by statisticians engaged in scientific collaborations, where the complexity of the data require use of high-dimensional or nonparametric methods, and statistical methods need to leverage lower dimensional structure that exists in the data.
The first chapter concerns the promise and challenge of using large datasets to uncover causal mechanisms. Randomized trials remain the gold-standard for inferring causal effects of treatment a century after their introduction by Fisher and Neyman. In this chapter, we examine whether large numbers of auxilary covariates in a randomized experiment can be leveraged to help improve estimates of the treatment effect and increase power. In particular, we investigate Lasso-based adjustments of treatment effects through theory, simulation, and a case study of a randomized trial of the pulmonary artery cathether. In our investigation, we avoid imposing a linear model, and examine the robustness of Lasso to violations of traditional assumptions.
The second chapter examines the use of predictive models to elucidate functional properties of the mammalian visual cortex. We investigate the activity of single neurons in area MT when stimulated with natural video. One way to investigate single-neuron activity is to build encoding models that predict spike rate given an arbitrary natural stimulus. In this work, we develop encoding models that combine a nonlinear feature extraction step with a linear model. The feature extraction step is unsupervised, and is based on the principle of sparse coding. We compare this model to one that applies relatively simple, fixed nonlinearities to the outputs of V1-like spatiotemporal filters. We find evidence that some MT cells may be tuned to more complex video features than previously thought.
The third chapter examines a computational challenge inherent in nonparametric modeling of large datasets. Large datasets are often stored across many machines in a computer cluster, where communication between machines is slow. Hence, nonparametric regression methods should avoid communication of data as much as possible. Random forests, among the most popular nonparametric methods for regression, are not well-suited to distributed architectures. We develop a modification of random forests that leverage ideas in nonparametric regression by local modeling. Our method allows for training of random forests completely in parallel, without synchronization between machines, with communication of sufficient statistics at test-time only. We show that this method can improve the predictive performance of standard random forests even in the single-machine case, and that performance remains strong when data is distributed.