## Type of Work

Article (35) Book (0) Theses (20) Multimedia (0)

## Peer Review

Peer-reviewed only (54)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (44) UC Davis (1) UC Irvine (1) UCLA (0) UC Merced (0) UC Riverside (7) UC San Diego (2) UCSF (0) UC Santa Barbara (0) UC Santa Cruz (1) UC Office of the President (0) Lawrence Berkeley National Laboratory (24) UC Agriculture & Natural Resources (0)

## Department

Bourns College of Engineering (2) Donald Bren School of Information and Computer Sciences (1) Department of Statistics (1)

Microbiology and Plant Pathology (1) School of Medicine (1)

## Journal

## Discipline

## Reuse License

## Scholarly Works (55 results)

Rapidly moving technologies are transforming the rate at which researchers accumulate information. Large, rich datasets hold promises of new insights into complex natural phenomena that will help advance the frontier of science. Here we aim to develop new statistics/data science principles and scalable algorithms for extracting reliable and reproducible information from these data.

Chapter 1 provides an overview of the work contained in this thesis. It discusses the growing availability of genomic data and the statistical machine learning tools that are being used to provide a systems-level understanding of genomic phenomena.

Chapter 2 introduces the predictability, computability, and stability (PCS) framework. The PCS framework builds on key ideas in machine learning, using predictability as a reality check and evaluating computational considerations in data collection, data storage and algorithm design. It augments predictability and computability with an overarching stability principle, which expands statistical uncertainty considerations to assesses how results vary with respect to choices (or perturbations) made across the data science life cycle. In this chapter, we develop PCS inference through perturbation intervals and PCS hypothesis testing to investigate the reliability of data results. We compare PCS inference with existing methods in high-dimensional sparse linear model simulations to demonstrate that our approach compares favorably to others, in terms of ROC curves, over a wide range of simulation settings. Finally, we propose documentation based on R Markdown, iPython, or Jupyter Notebook, with publicly available, reproducible codes and narratives to justify human choices made throughout an analysis.

As an example of the PCS framework in practice, chapter 3 develops the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as Random Forests (RF). We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, 80% of the pairwise transcription factor interactions iRF identified as stable have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

Chapter 4 refines iRF to explicitly map responses as a function of interacting features. Our proposed method, signed iRF (siRF), describes "subsets" of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. RF decision paths containing the same signed interaction share not only a set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We formulate stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate siRF in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of spatial gene expression patterns, siRF recovered all 11 reported links in the gap gene network. In the case of enhancer activity, siRF discovered rules that identify enhancer elements in Drosophila embryos with high precision, suggesting candidate biological mechanisms for experimental studies. By refining the process of interaction discovery, siRF has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.

Drawing samples from a known distribution is a core computational challenge common to many disciplines, with applications in statistics, probability, operations research, and other areas involving stochastic models. In statistics, sampling methods are useful for both estimation and inference, including problems such as estimating expectations of desired quantities, computing probabilities of rare events, gauging volumes of particular sets, exploring posterior distributions and obtaining credible intervals etc.

Facing massive high dimensional data, both computational efficiency and good statistical guarantees are more and more important in modern statistical and machine learning applications. In this thesis, centered around sampling algorithms, we consider the fundamental questions on their computational and statistical guarantees: How to design a fast sampling algorithm and how long should it be run? What are the statistical learning guarantee of these algorithms? Are there any trade-offs between computation and learning?

To answer these questions, first we start with establishing non-asymptotic convergence guarantees for popular MCMC sampling algorithms in Bayesian literature: Metropolized Random Walk, Metropolis-adjusted Langevin algorithm and Hamiltonian Monte Carlo. To address a number of technical challenges arise enroute, we develop results based on the conductance profile in order to prove quantitative convergence guarantees general continuous state space Markov chains. Second, to confront a large class of constrained sampling problems, we introduce two new algorithms, Vaidya and John walks, to sample from polytope-constrained distributions with convergence guarantees. Third, we prove fundamental trade-off results between statistical learning performance and convergence rate of any iterative learning algorithm, including sample algorithms. The trade-off results allow us to show that a too stable algorithm can not converge too fast, and vice-versa. Finally, to help neuroscientists analyze their massive amount of brain data, we develop DeepTune, a stability-driven visualization and interpretation framework via optimization and sampling for the neural-network-based models of neurons in visual cortex.

Many scientific fields have been changed by rapid technological progress in data collection, storage, and processing. This has greatly expanded the role of statistics in scientific research. The three chapters of this thesis examine core challenges faced by statisticians engaged in scientific collaborations, where the complexity of the data require use of high-dimensional or nonparametric methods, and statistical methods need to leverage lower dimensional structure that exists in the data.

The first chapter concerns the promise and challenge of using large datasets to uncover causal mechanisms. Randomized trials remain the gold-standard for inferring causal effects of treatment a century after their introduction by Fisher and Neyman. In this chapter, we examine whether large numbers of auxilary covariates in a randomized experiment can be leveraged to help improve estimates of the treatment effect and increase power. In particular, we investigate Lasso-based adjustments of treatment effects through theory, simulation, and a case study of a randomized trial of the pulmonary artery cathether. In our investigation, we avoid imposing a linear model, and examine the robustness of Lasso to violations of traditional assumptions.

The second chapter examines the use of predictive models to elucidate functional properties of the mammalian visual cortex. We investigate the activity of single neurons in area MT when stimulated with natural video. One way to investigate single-neuron activity is to build encoding models that predict spike rate given an arbitrary natural stimulus. In this work, we develop encoding models that combine a nonlinear feature extraction step with a linear model. The feature extraction step is unsupervised, and is based on the principle of sparse coding. We compare this model to one that applies relatively simple, fixed nonlinearities to the outputs of V1-like spatiotemporal filters. We find evidence that some MT cells may be tuned to more complex video features than previously thought.

The third chapter examines a computational challenge inherent in nonparametric modeling of large datasets. Large datasets are often stored across many machines in a computer cluster, where communication between machines is slow. Hence, nonparametric regression methods should avoid communication of data as much as possible. Random forests, among the most popular nonparametric methods for regression, are not well-suited to distributed architectures. We develop a modification of random forests that leverage ideas in nonparametric regression by local modeling. Our method allows for training of random forests completely in parallel, without synchronization between machines, with communication of sufficient statistics at test-time only. We show that this method can improve the predictive performance of standard random forests even in the single-machine case, and that performance remains strong when data is distributed.

This dissertation discusses how predictive models are being used for scientific inquiry. Statistical and computational advances have given rise to high-dimensional models that can be fit on relatively small samples but still predict well the behavior of complex systems. Scientists try to use such models to learn about complex biological systems; but it is not always clear how prediction accuracy translates to understanding the underlying system. In the chapters below, I present different approaches to learn from predictive models in bioinformatics and neuroscience. In each of these collaborative works, we tailor models that would both fit well and be interpretable in the context of the scientific questions.

In the first chapter, we fit and compare predictive models for the GC-content bias, an important confounder in DNA-sequencing. We develop a high-resolution model that treats each base-pair in the genome as a separate example; this allows us to compare many representations of GC-content, identifying which representation best predicts the variation in the coverage. To deal with the huge volumes of data, we develop a new conditional dependence measure that efficiently compares different models. Selection of the model that maximizes this dependence reveals a recurring association with an experimental parameter: the selected model in each sample corresponds to a window size almost identical to the average size of DNA fragments in the sample. This recurring result can be used both for correcting the bias and for learning about the causes for the bias.

In the next chapter, we propose a new estimator for interpreting prediction-accuracy results of models for neural activity in the visual cortex. Our shuffle estimator targets the explainable variance - the proportion of signal in the measured response - while accounting for auto-correlation in the noise. Re-analyzing models of functional MRI voxels within visual area V1, we observe a strong linear correlation between the signal-to-noise and prediction accuracy.

In the final chapter we analyze neurophysiology data recorded from visual area V4, and present a full cycle of scientific investigation using prediction models in neuroscience. Whereas the previous chapters developed metrics for evaluating feature sets and prediction models, this chapter takes an extra leap: we use optimization algorithms together with prior scientific knowledge to propose a new feature-set. We then fit regularized linear models based on this representation that generalize well to a validation data set. Finally, novel visualization and model-summary techniques help interpret the resulting prediction models, revealing rich patterns of activity in the different neurons and unexpected categories of neurons.