## Type of Work

Article (37) Book (0) Theses (20) Multimedia (0)

## Peer Review

Peer-reviewed only (56)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (42) UC Davis (1) UC Irvine (1) UCLA (0) UC Merced (0) UC Riverside (7) UC San Diego (3) UCSF (2) UC Santa Barbara (0) UC Santa Cruz (1) UC Office of the President (1) Lawrence Berkeley National Laboratory (23) UC Agriculture & Natural Resources (0)

## Department

Bourns College of Engineering (2) School of Medicine (2) Department of Epidemiology and Biostatistics (1) Donald Bren School of Information and Computer Sciences (1) Department of Statistics (1)

Microbiology and Plant Pathology (1) Research Grants Program Office (RGPO) (1)

## Journal

## Discipline

## Reuse License

## Scholarly Works (57 results)

This dissertation discusses how predictive models are being used for scientific inquiry. Statistical and computational advances have given rise to high-dimensional models that can be fit on relatively small samples but still predict well the behavior of complex systems. Scientists try to use such models to learn about complex biological systems; but it is not always clear how prediction accuracy translates to understanding the underlying system. In the chapters below, I present different approaches to learn from predictive models in bioinformatics and neuroscience. In each of these collaborative works, we tailor models that would both fit well and be interpretable in the context of the scientific questions.

In the first chapter, we fit and compare predictive models for the GC-content bias, an important confounder in DNA-sequencing. We develop a high-resolution model that treats each base-pair in the genome as a separate example; this allows us to compare many representations of GC-content, identifying which representation best predicts the variation in the coverage. To deal with the huge volumes of data, we develop a new conditional dependence measure that efficiently compares different models. Selection of the model that maximizes this dependence reveals a recurring association with an experimental parameter: the selected model in each sample corresponds to a window size almost identical to the average size of DNA fragments in the sample. This recurring result can be used both for correcting the bias and for learning about the causes for the bias.

In the next chapter, we propose a new estimator for interpreting prediction-accuracy results of models for neural activity in the visual cortex. Our shuffle estimator targets the explainable variance - the proportion of signal in the measured response - while accounting for auto-correlation in the noise. Re-analyzing models of functional MRI voxels within visual area V1, we observe a strong linear correlation between the signal-to-noise and prediction accuracy.

In the final chapter we analyze neurophysiology data recorded from visual area V4, and present a full cycle of scientific investigation using prediction models in neuroscience. Whereas the previous chapters developed metrics for evaluating feature sets and prediction models, this chapter takes an extra leap: we use optimization algorithms together with prior scientific knowledge to propose a new feature-set. We then fit regularized linear models based on this representation that generalize well to a validation data set. Finally, novel visualization and model-summary techniques help interpret the resulting prediction models, revealing rich patterns of activity in the different neurons and unexpected categories of neurons.

Rapidly moving technologies are transforming the rate at which researchers accumulate information. Large, rich datasets hold promises of new insights into complex natural phenomena that will help advance the frontier of science. Here we aim to develop new statistics/data science principles and scalable algorithms for extracting reliable and reproducible information from these data.

Chapter 1 provides an overview of the work contained in this thesis. It discusses the growing availability of genomic data and the statistical machine learning tools that are being used to provide a systems-level understanding of genomic phenomena.

Chapter 2 introduces the predictability, computability, and stability (PCS) framework. The PCS framework builds on key ideas in machine learning, using predictability as a reality check and evaluating computational considerations in data collection, data storage and algorithm design. It augments predictability and computability with an overarching stability principle, which expands statistical uncertainty considerations to assesses how results vary with respect to choices (or perturbations) made across the data science life cycle. In this chapter, we develop PCS inference through perturbation intervals and PCS hypothesis testing to investigate the reliability of data results. We compare PCS inference with existing methods in high-dimensional sparse linear model simulations to demonstrate that our approach compares favorably to others, in terms of ROC curves, over a wide range of simulation settings. Finally, we propose documentation based on R Markdown, iPython, or Jupyter Notebook, with publicly available, reproducible codes and narratives to justify human choices made throughout an analysis.

As an example of the PCS framework in practice, chapter 3 develops the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as Random Forests (RF). We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, 80% of the pairwise transcription factor interactions iRF identified as stable have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

Chapter 4 refines iRF to explicitly map responses as a function of interacting features. Our proposed method, signed iRF (siRF), describes "subsets" of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. RF decision paths containing the same signed interaction share not only a set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We formulate stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate siRF in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of spatial gene expression patterns, siRF recovered all 11 reported links in the gap gene network. In the case of enhancer activity, siRF discovered rules that identify enhancer elements in Drosophila embryos with high precision, suggesting candidate biological mechanisms for experimental studies. By refining the process of interaction discovery, siRF has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.

Atmospheric aerosols are solid particles and liquid droplets that are usually smaller than the diameter of a human hair. They can be found drifting in the air in every ecosystem on Earth, leaving significant impacts on human health and our climate. Understanding the spatial and temporal distribution of different atmospheric aerosols, therefore, is an important first step to decode the complex system of aerosols and further, their effects on public health and climate.

The development of remote-sensing radiometers provides a powerful tool to monitor the amount of atmospheric aerosols, as well as their compositions. Radiometers aboard satellites measure the amount of electromagnetic solar radiation. The amount of atmospheric aerosols is further quantified by aerosol optical depth (AOD), defined as the amount of solar radiation that aerosols scatter and absorb in the atmosphere and generally prevent from reaching the Earth surface. Despite efforts to improve remote-sensing instruments and a great demand for a detailed profile of aerosol spatial distribution, methods needed to provide AOD estimation at a reasonably fine resolution, are lacking. The quantitative uncertainties in the amount of aerosols, and especially aerosol compositions, limit the utility of traditional methods for aerosol retrieval at a fine resolution.

In Chapter 2 and 3 of this thesis, we exploit the use of statistical methods to estimate aerosol optical depth using remote-sensed radiation. A Bayesian hierarchy proves to be useful for modeling the complicated interactions among aerosols of different amount and compositions over a large spatial area. Based on the hierarchical model, Chapter 2 estimates and validates aerosol optical depth using Markov chain Monte Carlo methods, while chapter 3 resorts to an optimization-based approach for faster computation. We extend our study focus from the aerosol amount to the aerosol compositions in Chapter 4.

Chapter 1 briefly reviews the characteristics of atmospheric aerosols, including the different types of aerosols and their major impacts on human health. We also introduce a major remote-sensing instrument, NASA's Multi-angle Imaging SpectroRadiometer (MISR), which collects the observations our studies base on. Currently, the MISR operational aerosol retrieval algorithm provides estimates of aerosol optical depth at the spatial resolution of 17.6 km.

In Chapter 2, we embed MISR's operational weighted least squares criterion and its forward calculations for aerosol optical depth retrievals in a likelihood framework. We further expand it into a hierarchical Bayesian model to adapt to finer spatial resolution of 4.4 km. To take advantage of the spatial smoothness of aerosol optical depth, our method borrows strength from data at neighboring areas by postulating a Gaussian Markov Random Field prior for aerosol optical depth. Our model considers aerosol optical depth and mixing vectors of different types of aerosols as continuous variables. The inference is then carried out using Metropolis-within-Gibbs sampling methods. Retrieval uncertainties are quantified by posterior variabilities. We also develop a parallel Markov chain Monte Carlo algorithm to improve computational efficiency. We assess our retrieval performance using ground-based measurements from the AErosol RObotic NETwork (AERONET) and satellite images from Google Earth. Based on case studies in the greater Beijing area, China, we show that 4.4 km resolution can improve both the accuracy and coverage of remote-sensed aerosol retrievals, as well as our understanding of the spatial and seasonal behaviors of aerosols. This is particularly important during high-AOD events, which often indicate severe air pollution.

Chapter 3 of this thesis continues to improve our statistical aerosol retrievals for better accuracy and more efficient computation by switching to an optimization-based approach. We first establish objective functions for aerosol optical depth and aerosol compositions, based upon MISR operational weighted least squares criterion and its forward calculations. Our method also borrows strength from aerosol spatial smoothness by constructing penalty terms in the objective functions. The penalties correspond to a Gaussian Markov Random Field prior for aerosol optical depth and a Dirichlet prior for aerosol mixing vectors under our hierarchical Bayesian scheme; the optimization-based approach corresponds to Bayesian Maximum a Posteriori (MAP) estimation. Our MAP retrieval algorithm provides computational efficiency almost 60 times that of our Bayesian retrieval algorithm presented in Chapter 2. To represent the increasing heterogeneity of urban aerosol sources, our model continues to expand the pre-fixed aerosol mixtures used in the MISR operational algorithm by considering aerosol mixing vectors as continuous variables. Our retrievals are again validated using ground-based AERONET measurements. Case studies in the greater Beijing and Zhengzhou areas of China reassure that 4.4 km resolution can improve the accuracy and spatial coverage of remotely-sensed retrievals and enhance our understanding of the spatial behaviors of aerosols.

When comparing our aerosol retrievals to the extensive ground-based measurements collected in Baltimore, Maryland, we encountered greater uncertainties of aerosol compositions. It is a result from both the complex terrain structures of Baltimore and its various aerosol emission sources. Chapter 4, as result, extends the flexibility of our previous aerosol retrievals by incorporating a complete set of the eight commonly observed types of aerosols. The consequential rise in model complexity is met by a warm-start Markov chain Monte Carlo sampling scheme. We first design two Markov sub-chains, each representing an aerosol mixture containing only four types of the commonly observed aerosols. Combining the samples generated by these two sub-chains, we propose an initialization for the Markov chain that contains all eight types of commonly observed aerosols. Partial information on the interactions of different types of aerosols from the samples generated by the sub-chains proves to be useful in choosing a more efficient initial point for the complete Markov chain. Faster computation is achieved without compromising the retrieval accuracy nor the spatial resolution of the estimated aerosol optical depth. In the end, through case studies of aerosol retrievals for the Baltimore area, we explore the potentials of remote-sensed retrievals in improving our understanding of aerosol compositions.

Spatial gene expression data enable the detection of local covariability and are extremely useful for identifying local gene interactions during normal development. The abundance of spatial expression data in recent years has led to the modeling and analysis of regulatory networks. The inherent complexity of such data makes it a challenge to extract biological information. In the first part of the thesis, we developed staNMF, a method that combines a dictionary learning algorithm called nonnegative matrix factorization (NMF), with a new stability-driven criterion to select the number of dictionary atoms. When applied to a set of {\em Drosophila} early embryonic spatial gene expression images, one of the largest datasets of its kind, staNMF identified a dictionary with 21 atoms, which we call {\em principal patterns} (PP). Providing a compact yet biologically interpretable representation of {\em Drosophila} expression patterns, PP are comparable to a fate map generated experimentally by laser ablation and show exceptional promise as a data-driven alternative to manual annotations. Our analysis mapped genes to cell-fate programs and assigned putative biological roles to uncharacterized genes. Furthermore, we used the PP to generate local transcription factor (TF) regulatory networks. Spatially local correlation networks (SLCN) were constructed for six PP that span along the embryonic anterior-posterior axis. Using a two-tail 5\% cut-off on correlation, we reproduced 10 of the 11 links in the well-studied gap gene network. The performance of PP with the {\em Drosophila} data suggests that staNMF provides informative decompositions and constitutes a useful computational lens through which to extract biological insight from complex and often noisy gene expression data.

The biological interpretability of the NMF-derived dictionary motivated us to understand why dictionary learning works analytically. In particular, if the observed data are generated from a ground truth dictionary, under what conditions can dictionary learning recovers the true dictionary? In the second part of the thesis, we studied the local correctness, or {\em local identifiability}, of a particular dictionary learning formulation with the $l_1$-norm objective function. Suppose we observe $N$ data points $\x_i\in \mathbb R^K$ for $i=1,...,N$, where $\x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a square and invertible dictionary $\D_0 \in \mathbb R^{K\times K}$. We assumed that the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. For the population case, we established a sufficient and almost necessary condition for $\D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition in Gribonval and Schnass (2010). Moreover, we demonstrated that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\coh\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Finally, it was shown that our local identifiability results translate to the finite sample case with high probability provided $N = O(K\log K)$.

Drawing samples from a known distribution is a core computational challenge common to many disciplines, with applications in statistics, probability, operations research, and other areas involving stochastic models. In statistics, sampling methods are useful for both estimation and inference, including problems such as estimating expectations of desired quantities, computing probabilities of rare events, gauging volumes of particular sets, exploring posterior distributions and obtaining credible intervals etc.

Facing massive high dimensional data, both computational efficiency and good statistical guarantees are more and more important in modern statistical and machine learning applications. In this thesis, centered around sampling algorithms, we consider the fundamental questions on their computational and statistical guarantees: How to design a fast sampling algorithm and how long should it be run? What are the statistical learning guarantee of these algorithms? Are there any trade-offs between computation and learning?

To answer these questions, first we start with establishing non-asymptotic convergence guarantees for popular MCMC sampling algorithms in Bayesian literature: Metropolized Random Walk, Metropolis-adjusted Langevin algorithm and Hamiltonian Monte Carlo. To address a number of technical challenges arise enroute, we develop results based on the conductance profile in order to prove quantitative convergence guarantees general continuous state space Markov chains. Second, to confront a large class of constrained sampling problems, we introduce two new algorithms, Vaidya and John walks, to sample from polytope-constrained distributions with convergence guarantees. Third, we prove fundamental trade-off results between statistical learning performance and convergence rate of any iterative learning algorithm, including sample algorithms. The trade-off results allow us to show that a too stable algorithm can not converge too fast, and vice-versa. Finally, to help neuroscientists analyze their massive amount of brain data, we develop DeepTune, a stability-driven visualization and interpretation framework via optimization and sampling for the neural-network-based models of neurons in visual cortex.

Many scientific fields have been changed by rapid technological progress in data collection, storage, and processing. This has greatly expanded the role of statistics in scientific research. The three chapters of this thesis examine core challenges faced by statisticians engaged in scientific collaborations, where the complexity of the data require use of high-dimensional or nonparametric methods, and statistical methods need to leverage lower dimensional structure that exists in the data.

The first chapter concerns the promise and challenge of using large datasets to uncover causal mechanisms. Randomized trials remain the gold-standard for inferring causal effects of treatment a century after their introduction by Fisher and Neyman. In this chapter, we examine whether large numbers of auxilary covariates in a randomized experiment can be leveraged to help improve estimates of the treatment effect and increase power. In particular, we investigate Lasso-based adjustments of treatment effects through theory, simulation, and a case study of a randomized trial of the pulmonary artery cathether. In our investigation, we avoid imposing a linear model, and examine the robustness of Lasso to violations of traditional assumptions.

The second chapter examines the use of predictive models to elucidate functional properties of the mammalian visual cortex. We investigate the activity of single neurons in area MT when stimulated with natural video. One way to investigate single-neuron activity is to build encoding models that predict spike rate given an arbitrary natural stimulus. In this work, we develop encoding models that combine a nonlinear feature extraction step with a linear model. The feature extraction step is unsupervised, and is based on the principle of sparse coding. We compare this model to one that applies relatively simple, fixed nonlinearities to the outputs of V1-like spatiotemporal filters. We find evidence that some MT cells may be tuned to more complex video features than previously thought.

The third chapter examines a computational challenge inherent in nonparametric modeling of large datasets. Large datasets are often stored across many machines in a computer cluster, where communication between machines is slow. Hence, nonparametric regression methods should avoid communication of data as much as possible. Random forests, among the most popular nonparametric methods for regression, are not well-suited to distributed architectures. We develop a modification of random forests that leverage ideas in nonparametric regression by local modeling. Our method allows for training of random forests completely in parallel, without synchronization between machines, with communication of sufficient statistics at test-time only. We show that this method can improve the predictive performance of standard random forests even in the single-machine case, and that performance remains strong when data is distributed.