Search

Scholarly Works (90 results)

Sort By:

Show:

Article
Peer Reviewed

Three principles of data science: predictability, computability, and stability (PCS)

Yu, Bin

UC Berkeley Previously Published Works (2018)

Article
Peer Reviewed

Stability

Yu, Bin

UC Berkeley Previously Published Works (2013)

Reproducibility is imperative for any scientific discovery. More often than not, modern scientific findings rely on statistical analysis of high-dimensional data. At a minimum, reproducibility manifests itself in stability of statistical results relative to "reasonable" perturbations to data and to the model used. Jacknife, bootstrap, and cross-validation are based on perturbations to data, while robust statistics methods deal with perturbations to models. In this article, a case is made for the importance of stability in statistics. Firstly, we motivate the necessity of stability for interpretable and reliable encoding models from brain fMRI signals. Secondly, we find strong evidence in the literature to demonstrate the central role of stability in statistical inference, such as sensitivity analysis and effect detection. Thirdly, a smoothing parameter selector based on estimation stability (ES), ES-CV, is proposed for Lasso, in order to bring stability to bear on cross-validation (CV). ES-CV is then utilized in the encoding models to reduce the number of predictors by 60% with almost no loss (1.3%) of prediction performance across over 2,000 voxels. Last, a novel "stability" argument is seen to drive new results that shed light on the intriguing interactions between sample to sample variability and heavier tail error distribution (e.g., double-exponential) in high-dimensional regression models with p predictors and n independent samples. In particular, when p/n → κ ∈ (0.3, 1) and the error distribution is double-exponential, the Ordinary Least Squares (OLS) is a better estimator than the Least Absolute Deviation (LAD) estimator. © 2013 ISI/BS.

Article
Peer Reviewed

Three Principles of Data Science

Yu, Bin

UC Berkeley Previously Published Works (2017)

Article
Peer Reviewed

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

UC Berkeley Previously Published Works (2013)

We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG).We introduce structured sparsity penalties over paths on a DAG called "path coding" penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs. © 2013 Julien Mairal and Bin Yu.

Thesis
Peer Reviewed

Statistical Machine Learning for Reliable Hypothesis Generation in Biomedical Problems

Tang, Tiffany
Advisor(s): Yu, Bin

UC Berkeley Electronic Theses and Dissertations (2023)

Given the ever-growing volume and variety of biomedical data, principled analyses of these rich datasets offer an exciting opportunity to accelerate the scientific discovery process. Here, we advance our goal of extracting reliable scientific hypotheses from such data through (I) the in-context development of interpretable statistical machine learning methods, (II) the demonstration of responsible data science in practice, and (III) the dissemination of open-source software and data for reliable data science.

Throughout this dissertation, we build heavily upon the Predictability, Computability, and Stability (PCS) framework and documentation for veridical (trustworthy) data science (Yu and Kumbier, 2020) to improve the reliability of our scientific conclusions. This framework advocates for the use of predictability as a reality check, computability as an important consideration in algorithmic design and data collection, and stability as a minimum requirement for reproducibility and interpretability in knowledge-seeking and decision-making. Moreover, it calls on the need for transparent documentation of decisions made throughout the data science pipeline.

In Part I, we highlight two statistical machine learning methods, developed within the context of grounded biomedical problems and guided by the PCS framework. First, in Chapter 2, we investigate genetic and epistatic drivers of cardiac hypertrophy in hope of obtaining a more complete understanding of the disease architecture. To this end, we develop a data-driven recommendation system, named the low-signal signed iterative random forest (lo-siRF), to identify candidate genes and gene-gene interactions that are both predictive and stable across various model and data perturbations. We then phenotypically validate these genes and gene-gene interactions via gene-silencing experiments and investigate potential mechanistic explanations for the demonstrated epistases. This leads to a hypothesis in which the identified genes interact through mediating the variable binding of transcription factors that are essential for cardiac contractile function and metabolism. Second, the practical utility of random forests and interpretability tools, not only in the search for epistasis but in a wide range of scientific problems, motivates the need for reliable tree-based feature importance measures. In Chapter 3, we demonstrate that the mean decrease in impurity (MDI), arguably the most popular random forest feature importance measure, suffers from well-known biases including against highly-correlated and low-entropy features. To overcome these drawbacks, we develop a novel feature importance framework, MDI+, which leverages a connection between MDI and the R-squared value from linear regression. We show that MDI+ improves the reliability and stability of feature importance rankings across an extensive range of data-inspired simulations and two real-data case studies on drug response prediction and breast cancer subtype prediction.

In Part II, we further expand on the theme of reliable data science and demonstrate it in practice through two collaborative projects in cancer -omics. In Chapters 4 and 5, we incorporate principles from the PCS framework while working in close collaboration with scientists and clinicians to identify stable and predictive biomarkers in drug response prediction and the early detection of pancreatic cancer, respectively.

Finally, in Part III, we introduce open-source software and data to promote and facilitate the broader adoption of reliable, transparent data science for statisticians and substantive researchers. In particular, we highlight three tools that support our goals: (1) simChef, an R package to simplify the creation of tidy, high-quality simulation studies (Chapter 6); (2) vdocs, an interactive virtual lab notebook in R to seamlessly implement, document, and justify human judgment calls throughout the data science pipeline in accordance with the PCS framework (Chapter 7); and (3) a COVID-19 data repository that aided community-wide data science efforts during the height of the pandemic (Chapter 8).

Cover page: Statistical Machine Learning for Reliable Hypothesis Generation in Biomedical Problems

Thesis
Peer Reviewed

Aerosol Retrieval Using Remote-sensed Observations

Wang, Yueqing
Advisor(s): Yu, Bin

UC Berkeley Electronic Theses and Dissertations (2012)

Atmospheric aerosols are solid particles and liquid droplets that are usually smaller than the diameter of a human hair. They can be found drifting in the air in every ecosystem on Earth, leaving significant impacts on human health and our climate. Understanding the spatial and temporal distribution of different atmospheric aerosols, therefore, is an important first step to decode the complex system of aerosols and further, their effects on public health and climate.

The development of remote-sensing radiometers provides a powerful tool to monitor the amount of atmospheric aerosols, as well as their compositions. Radiometers aboard satellites measure the amount of electromagnetic solar radiation. The amount of atmospheric aerosols is further quantified by aerosol optical depth (AOD), defined as the amount of solar radiation that aerosols scatter and absorb in the atmosphere and generally prevent from reaching the Earth surface. Despite efforts to improve remote-sensing instruments and a great demand for a detailed profile of aerosol spatial distribution, methods needed to provide AOD estimation at a reasonably fine resolution, are lacking. The quantitative uncertainties in the amount of aerosols, and especially aerosol compositions, limit the utility of traditional methods for aerosol retrieval at a fine resolution.

In Chapter 2 and 3 of this thesis, we exploit the use of statistical methods to estimate aerosol optical depth using remote-sensed radiation. A Bayesian hierarchy proves to be useful for modeling the complicated interactions among aerosols of different amount and compositions over a large spatial area. Based on the hierarchical model, Chapter 2 estimates and validates aerosol optical depth using Markov chain Monte Carlo methods, while chapter 3 resorts to an optimization-based approach for faster computation. We extend our study focus from the aerosol amount to the aerosol compositions in Chapter 4.

Chapter 1 briefly reviews the characteristics of atmospheric aerosols, including the different types of aerosols and their major impacts on human health. We also introduce a major remote-sensing instrument, NASA's Multi-angle Imaging SpectroRadiometer (MISR), which collects the observations our studies base on. Currently, the MISR operational aerosol retrieval algorithm provides estimates of aerosol optical depth at the spatial resolution of 17.6 km.

In Chapter 2, we embed MISR's operational weighted least squares criterion and its forward calculations for aerosol optical depth retrievals in a likelihood framework. We further expand it into a hierarchical Bayesian model to adapt to finer spatial resolution of 4.4 km. To take advantage of the spatial smoothness of aerosol optical depth, our method borrows strength from data at neighboring areas by postulating a Gaussian Markov Random Field prior for aerosol optical depth. Our model considers aerosol optical depth and mixing vectors of different types of aerosols as continuous variables. The inference is then carried out using Metropolis-within-Gibbs sampling methods. Retrieval uncertainties are quantified by posterior variabilities. We also develop a parallel Markov chain Monte Carlo algorithm to improve computational efficiency. We assess our retrieval performance using ground-based measurements from the AErosol RObotic NETwork (AERONET) and satellite images from Google Earth. Based on case studies in the greater Beijing area, China, we show that 4.4 km resolution can improve both the accuracy and coverage of remote-sensed aerosol retrievals, as well as our understanding of the spatial and seasonal behaviors of aerosols. This is particularly important during high-AOD events, which often indicate severe air pollution.

Chapter 3 of this thesis continues to improve our statistical aerosol retrievals for better accuracy and more efficient computation by switching to an optimization-based approach. We first establish objective functions for aerosol optical depth and aerosol compositions, based upon MISR operational weighted least squares criterion and its forward calculations. Our method also borrows strength from aerosol spatial smoothness by constructing penalty terms in the objective functions. The penalties correspond to a Gaussian Markov Random Field prior for aerosol optical depth and a Dirichlet prior for aerosol mixing vectors under our hierarchical Bayesian scheme; the optimization-based approach corresponds to Bayesian Maximum a Posteriori (MAP) estimation. Our MAP retrieval algorithm provides computational efficiency almost 60 times that of our Bayesian retrieval algorithm presented in Chapter 2. To represent the increasing heterogeneity of urban aerosol sources, our model continues to expand the pre-fixed aerosol mixtures used in the MISR operational algorithm by considering aerosol mixing vectors as continuous variables. Our retrievals are again validated using ground-based AERONET measurements. Case studies in the greater Beijing and Zhengzhou areas of China reassure that 4.4 km resolution can improve the accuracy and spatial coverage of remotely-sensed retrievals and enhance our understanding of the spatial behaviors of aerosols.

When comparing our aerosol retrievals to the extensive ground-based measurements collected in Baltimore, Maryland, we encountered greater uncertainties of aerosol compositions. It is a result from both the complex terrain structures of Baltimore and its various aerosol emission sources. Chapter 4, as result, extends the flexibility of our previous aerosol retrievals by incorporating a complete set of the eight commonly observed types of aerosols. The consequential rise in model complexity is met by a warm-start Markov chain Monte Carlo sampling scheme. We first design two Markov sub-chains, each representing an aerosol mixture containing only four types of the commonly observed aerosols. Combining the samples generated by these two sub-chains, we propose an initialization for the Markov chain that contains all eight types of commonly observed aerosols. Partial information on the interactions of different types of aerosols from the samples generated by the sub-chains proves to be useful in choosing a more efficient initial point for the complete Markov chain. Faster computation is achieved without compromising the retrieval accuracy nor the spatial resolution of the estimated aerosol optical depth. In the end, through case studies of aerosol retrievals for the Baltimore area, we explore the potentials of remote-sensed retrievals in improving our understanding of aerosol compositions.

Cover page: Aerosol Retrieval Using Remote-sensed Observations

Article
Peer Reviewed

Impact of Regularization on Spectral Clustering

UC Berkeley Previously Published Works (2014)

Clustering in networks/graphs is an important problem with applications in the analysis of gene-gene interactions, social networks, text mining, to name a few. Spectral clustering is one of the more popular techniques for such purposes, chiefly due to its computational advantage and generality of application. The algorithm's generality arises from the fact that it is not tied to any modeling assumptions on the data, but is rooted in intuitive measures of community structure such as sparsest cut based measures (Hagen and Kahng (1992), Shi and Malik (2000), Ng. et. al (2002)). © 2014 IEEE.

Thesis
Peer Reviewed

Fast MCMC algorithms, Stability and DeepTune

Chen, Yuansi
Advisor(s): Yu, Bin

UC Berkeley Electronic Theses and Dissertations (2019)

Drawing samples from a known distribution is a core computational challenge common to many disciplines, with applications in statistics, probability, operations research, and other areas involving stochastic models. In statistics, sampling methods are useful for both estimation and inference, including problems such as estimating expectations of desired quantities, computing probabilities of rare events, gauging volumes of particular sets, exploring posterior distributions and obtaining credible intervals etc.

Facing massive high dimensional data, both computational efficiency and good statistical guarantees are more and more important in modern statistical and machine learning applications. In this thesis, centered around sampling algorithms, we consider the fundamental questions on their computational and statistical guarantees: How to design a fast sampling algorithm and how long should it be run? What are the statistical learning guarantee of these algorithms? Are there any trade-offs between computation and learning?

To answer these questions, first we start with establishing non-asymptotic convergence guarantees for popular MCMC sampling algorithms in Bayesian literature: Metropolized Random Walk, Metropolis-adjusted Langevin algorithm and Hamiltonian Monte Carlo. To address a number of technical challenges arise enroute, we develop results based on the conductance profile in order to prove quantitative convergence guarantees general continuous state space Markov chains. Second, to confront a large class of constrained sampling problems, we introduce two new algorithms, Vaidya and John walks, to sample from polytope-constrained distributions with convergence guarantees. Third, we prove fundamental trade-off results between statistical learning performance and convergence rate of any iterative learning algorithm, including sample algorithms. The trade-off results allow us to show that a too stable algorithm can not converge too fast, and vice-versa. Finally, to help neuroscientists analyze their massive amount of brain data, we develop DeepTune, a stability-driven visualization and interpretation framework via optimization and sampling for the neural-network-based models of neurons in visual cortex.

Cover page: Fast MCMC algorithms, Stability and DeepTune

Article
Peer Reviewed

Comments on: High-dimensional simultaneous inference with the bootstrap

UC Berkeley Previously Published Works (2017)

Article
Peer Reviewed

Local Identifiability of`l(1)-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition

UC Berkeley Previously Published Works (2018)