## Type of Work

Article (36) Book (0) Theses (20) Multimedia (0)

## Peer Review

Peer-reviewed only (55)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (42) UC Davis (1) UC Irvine (1) UCLA (0) UC Merced (0) UC Riverside (7) UC San Diego (3) UCSF (2) UC Santa Barbara (0) UC Santa Cruz (1) UC Office of the President (1) Lawrence Berkeley National Laboratory (22) UC Agriculture & Natural Resources (0)

## Department

Bourns College of Engineering (2) School of Medicine (2) Donald Bren School of Information and Computer Sciences (1) Department of Statistics (1)

Microbiology and Plant Pathology (1) Research Grants Program Office (RGPO) (1)

## Journal

## Discipline

## Reuse License

## Scholarly Works (56 results)

Crowdsourcing has become an effective and popular tool for human-powered computation to label large datasets. Since the workers can be unreliable, it is common in crowdsourcing to assign multiple workers to one task, and to aggregate the labels in order to obtain results of high quality. This thesis is dedicated to analyze the process of crowdsourced labeling and propose algorithms to improve the efficiency of this process.

In Chapter 1, we give an overview of crowdsourcing and its applications. Meanwhile, we address the challenges of labeling by crowdsourcing and present the organization of the thesis.

In Chapter 2, we provide finite-sample exponential bounds on the error rate (in probability and in expectation) of general aggregation rules under the Dawid-Skene crowdsourcing model. The bounds are derived for multi-class labeling, and can be used to analyze many aggregation methods, including majority voting, weighted majority voting and the oracle Maximum A Posteriori (MAP) rule. We show that the oracle MAP rule approximately optimizes the upper bound on the mean error rate of weighted majority voting in certain setting.

In Chapter 3, we propose an iterative weighted majority voting (IWMV) method that optimizes the error rate bound and approximates the oracle MAP rule. Its one step version has a provable theoretical guarantee of the error rate. The IWMV method is intuitive and computationally simple. Compared to the traditional EM approach, IWMV is more robust to the noise of estimating workers' reliabilities and model misspecification. Experimental results on both simulation and real data show that IWMV performs at least on par with the state-of-the-art methods, and it has low computation cost (around one hundred times faster than the state-of-the-art methods).

In Chapter 4, we study the problem of selecting subsets of workers from a given worker pool to maximize the accuracy under a budget constraint. One natural question is whether we should hire as many workers as the budget allows, or restrict on a small number of top-quality workers. By theoretically analyzing the error rate of a typical setting in crowdsourcing, we frame the worker selection problem into a combinatorial optimization problem and propose an algorithm to solve it efficiently. Empirical results on both simulated and real-world datasets show that our algorithm is able to select a small number of high-quality workers, and performs as good as, sometimes even better than, the much larger crowds as the budget allows.

In practice, worker quality (for some tasks) can be associated with some explicit characteristics of the worker, such as education level, major and age; although it is also possible that in some other tasks none of the attributes matter. In Chapter 5, we propose a general crowd targeting framework that can automatically discover, for a given task, if any group of workers based on their attributes have higher quality on average, and target these groups, if they exist, for future work of the same task. This framework is complementary to traditional worker quality estimation approaches and is more budget efficient because we are able to target potentially good workers before they actually do the task. Experiments on real datasets show that the accuracy of final prediction can be improved significantly for the same budget or even less budget in some cases. Our framework is very practical and can be easily integrated in current crowdsourcing platforms.

Spatial gene expression data enable the detection of local covariability and are extremely useful for identifying local gene interactions during normal development. The abundance of spatial expression data in recent years has led to the modeling and analysis of regulatory networks. The inherent complexity of such data makes it a challenge to extract biological information. In the first part of the thesis, we developed staNMF, a method that combines a dictionary learning algorithm called nonnegative matrix factorization (NMF), with a new stability-driven criterion to select the number of dictionary atoms. When applied to a set of {\em Drosophila} early embryonic spatial gene expression images, one of the largest datasets of its kind, staNMF identified a dictionary with 21 atoms, which we call {\em principal patterns} (PP). Providing a compact yet biologically interpretable representation of {\em Drosophila} expression patterns, PP are comparable to a fate map generated experimentally by laser ablation and show exceptional promise as a data-driven alternative to manual annotations. Our analysis mapped genes to cell-fate programs and assigned putative biological roles to uncharacterized genes. Furthermore, we used the PP to generate local transcription factor (TF) regulatory networks. Spatially local correlation networks (SLCN) were constructed for six PP that span along the embryonic anterior-posterior axis. Using a two-tail 5\% cut-off on correlation, we reproduced 10 of the 11 links in the well-studied gap gene network. The performance of PP with the {\em Drosophila} data suggests that staNMF provides informative decompositions and constitutes a useful computational lens through which to extract biological insight from complex and often noisy gene expression data.

The biological interpretability of the NMF-derived dictionary motivated us to understand why dictionary learning works analytically. In particular, if the observed data are generated from a ground truth dictionary, under what conditions can dictionary learning recovers the true dictionary? In the second part of the thesis, we studied the local correctness, or {\em local identifiability}, of a particular dictionary learning formulation with the $l_1$-norm objective function. Suppose we observe $N$ data points $\x_i\in \mathbb R^K$ for $i=1,...,N$, where $\x_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a square and invertible dictionary $\D_0 \in \mathbb R^{K\times K}$. We assumed that the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. For the population case, we established a sufficient and almost necessary condition for $\D_0$ to be locally identifiable, i.e., a local minimum of the expected $l_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition in Gribonval and Schnass (2010). Moreover, we demonstrated that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\coh\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzeros on average. Finally, it was shown that our local identifiability results translate to the finite sample case with high probability provided $N = O(K\log K)$.

Rapidly moving technologies are transforming the rate at which researchers accumulate information. Large, rich datasets hold promises of new insights into complex natural phenomena that will help advance the frontier of science. Here we aim to develop new statistics/data science principles and scalable algorithms for extracting reliable and reproducible information from these data.

Chapter 1 provides an overview of the work contained in this thesis. It discusses the growing availability of genomic data and the statistical machine learning tools that are being used to provide a systems-level understanding of genomic phenomena.

Chapter 2 introduces the predictability, computability, and stability (PCS) framework. The PCS framework builds on key ideas in machine learning, using predictability as a reality check and evaluating computational considerations in data collection, data storage and algorithm design. It augments predictability and computability with an overarching stability principle, which expands statistical uncertainty considerations to assesses how results vary with respect to choices (or perturbations) made across the data science life cycle. In this chapter, we develop PCS inference through perturbation intervals and PCS hypothesis testing to investigate the reliability of data results. We compare PCS inference with existing methods in high-dimensional sparse linear model simulations to demonstrate that our approach compares favorably to others, in terms of ROC curves, over a wide range of simulation settings. Finally, we propose documentation based on R Markdown, iPython, or Jupyter Notebook, with publicly available, reproducible codes and narratives to justify human choices made throughout an analysis.

As an example of the PCS framework in practice, chapter 3 develops the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as Random Forests (RF). We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, 80% of the pairwise transcription factor interactions iRF identified as stable have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

Chapter 4 refines iRF to explicitly map responses as a function of interacting features. Our proposed method, signed iRF (siRF), describes "subsets" of rules that frequently occur on RF decision paths. We refer to these rule subsets as signed interactions. RF decision paths containing the same signed interaction share not only a set of interacting features but also exhibit similar thresholding behavior, and thus describe a consistent functional relationship between interacting features and responses. We formulate stable and predictive importance metrics (SPIMs) to rank signed interactions in terms of their stability, predictive accuracy, and strength of interaction. For each SPIM, we define null importance metrics that characterize its expected behavior under known structure. We evaluate siRF in biologically inspired simulations and two case studies: predicting enhancer activity and spatial gene expression patterns. In the case of spatial gene expression patterns, siRF recovered all 11 reported links in the gap gene network. In the case of enhancer activity, siRF discovered rules that identify enhancer elements in Drosophila embryos with high precision, suggesting candidate biological mechanisms for experimental studies. By refining the process of interaction discovery, siRF has the potential to guide mechanistic inquiry into systems whose scale and complexity is beyond human comprehension.

Drawing samples from a known distribution is a core computational challenge common to many disciplines, with applications in statistics, probability, operations research, and other areas involving stochastic models. In statistics, sampling methods are useful for both estimation and inference, including problems such as estimating expectations of desired quantities, computing probabilities of rare events, gauging volumes of particular sets, exploring posterior distributions and obtaining credible intervals etc.

Facing massive high dimensional data, both computational efficiency and good statistical guarantees are more and more important in modern statistical and machine learning applications. In this thesis, centered around sampling algorithms, we consider the fundamental questions on their computational and statistical guarantees: How to design a fast sampling algorithm and how long should it be run? What are the statistical learning guarantee of these algorithms? Are there any trade-offs between computation and learning?

To answer these questions, first we start with establishing non-asymptotic convergence guarantees for popular MCMC sampling algorithms in Bayesian literature: Metropolized Random Walk, Metropolis-adjusted Langevin algorithm and Hamiltonian Monte Carlo. To address a number of technical challenges arise enroute, we develop results based on the conductance profile in order to prove quantitative convergence guarantees general continuous state space Markov chains. Second, to confront a large class of constrained sampling problems, we introduce two new algorithms, Vaidya and John walks, to sample from polytope-constrained distributions with convergence guarantees. Third, we prove fundamental trade-off results between statistical learning performance and convergence rate of any iterative learning algorithm, including sample algorithms. The trade-off results allow us to show that a too stable algorithm can not converge too fast, and vice-versa. Finally, to help neuroscientists analyze their massive amount of brain data, we develop DeepTune, a stability-driven visualization and interpretation framework via optimization and sampling for the neural-network-based models of neurons in visual cortex.

This dissertation is on high dimensional data and their associated regularization through dimension reduction and penalization.

We start with two real world problems to illustrate the practical difficulties and remedies in analyzing high dimensional data. In Chapter 1, we are tasked with modeling and predicting the U.S. stock market, where the number of stocks far exceeds the number of days relevant to the current market. Through an existing statistical arbitrage framework, we reduce the dimension of our problem with the use of correspondence analysis. We develop a data driven regression model and highlight some common statistical methods that improve our predictions. In Chapter 2, we attempt to detect and predict system anomalies in large enterprise telephony systems. We do this by processing large amounts of unstructured log files, again with dimension reduction methods, allowing effective visualization and automatic filtering of results.

We then move on to more general methodology and analysis in high dimensions.

In Chapter 3, we consider regularization methods, often used in dealing with high dimensional data, and tackle the problem of selecting the associated regularization parameter. We introduce *SSCV*, a selection criterion based on statistical stability, but also incorporating model fit, and show that it can often outperform the popular cross validation. Finally, we explore robust methods in the high dimensional setting in Chapter 4. We focus on the relative performance and distributional robustness of the estimators optimizing *L1* and *L2* loss functions respectively. We verify some expected results and also highlight cases where results from classical asymptotics fail, setting the stage for future theoretical work.

Atmospheric aerosols are solid particles and liquid droplets that are usually smaller than the diameter of a human hair. They can be found drifting in the air in every ecosystem on Earth, leaving significant impacts on human health and our climate. Understanding the spatial and temporal distribution of different atmospheric aerosols, therefore, is an important first step to decode the complex system of aerosols and further, their effects on public health and climate.

The development of remote-sensing radiometers provides a powerful tool to monitor the amount of atmospheric aerosols, as well as their compositions. Radiometers aboard satellites measure the amount of electromagnetic solar radiation. The amount of atmospheric aerosols is further quantified by aerosol optical depth (AOD), defined as the amount of solar radiation that aerosols scatter and absorb in the atmosphere and generally prevent from reaching the Earth surface. Despite efforts to improve remote-sensing instruments and a great demand for a detailed profile of aerosol spatial distribution, methods needed to provide AOD estimation at a reasonably fine resolution, are lacking. The quantitative uncertainties in the amount of aerosols, and especially aerosol compositions, limit the utility of traditional methods for aerosol retrieval at a fine resolution.

In Chapter 2 and 3 of this thesis, we exploit the use of statistical methods to estimate aerosol optical depth using remote-sensed radiation. A Bayesian hierarchy proves to be useful for modeling the complicated interactions among aerosols of different amount and compositions over a large spatial area. Based on the hierarchical model, Chapter 2 estimates and validates aerosol optical depth using Markov chain Monte Carlo methods, while chapter 3 resorts to an optimization-based approach for faster computation. We extend our study focus from the aerosol amount to the aerosol compositions in Chapter 4.

Chapter 1 briefly reviews the characteristics of atmospheric aerosols, including the different types of aerosols and their major impacts on human health. We also introduce a major remote-sensing instrument, NASA's Multi-angle Imaging SpectroRadiometer (MISR), which collects the observations our studies base on. Currently, the MISR operational aerosol retrieval algorithm provides estimates of aerosol optical depth at the spatial resolution of 17.6 km.

In Chapter 2, we embed MISR's operational weighted least squares criterion and its forward calculations for aerosol optical depth retrievals in a likelihood framework. We further expand it into a hierarchical Bayesian model to adapt to finer spatial resolution of 4.4 km. To take advantage of the spatial smoothness of aerosol optical depth, our method borrows strength from data at neighboring areas by postulating a Gaussian Markov Random Field prior for aerosol optical depth. Our model considers aerosol optical depth and mixing vectors of different types of aerosols as continuous variables. The inference is then carried out using Metropolis-within-Gibbs sampling methods. Retrieval uncertainties are quantified by posterior variabilities. We also develop a parallel Markov chain Monte Carlo algorithm to improve computational efficiency. We assess our retrieval performance using ground-based measurements from the AErosol RObotic NETwork (AERONET) and satellite images from Google Earth. Based on case studies in the greater Beijing area, China, we show that 4.4 km resolution can improve both the accuracy and coverage of remote-sensed aerosol retrievals, as well as our understanding of the spatial and seasonal behaviors of aerosols. This is particularly important during high-AOD events, which often indicate severe air pollution.

Chapter 3 of this thesis continues to improve our statistical aerosol retrievals for better accuracy and more efficient computation by switching to an optimization-based approach. We first establish objective functions for aerosol optical depth and aerosol compositions, based upon MISR operational weighted least squares criterion and its forward calculations. Our method also borrows strength from aerosol spatial smoothness by constructing penalty terms in the objective functions. The penalties correspond to a Gaussian Markov Random Field prior for aerosol optical depth and a Dirichlet prior for aerosol mixing vectors under our hierarchical Bayesian scheme; the optimization-based approach corresponds to Bayesian Maximum a Posteriori (MAP) estimation. Our MAP retrieval algorithm provides computational efficiency almost 60 times that of our Bayesian retrieval algorithm presented in Chapter 2. To represent the increasing heterogeneity of urban aerosol sources, our model continues to expand the pre-fixed aerosol mixtures used in the MISR operational algorithm by considering aerosol mixing vectors as continuous variables. Our retrievals are again validated using ground-based AERONET measurements. Case studies in the greater Beijing and Zhengzhou areas of China reassure that 4.4 km resolution can improve the accuracy and spatial coverage of remotely-sensed retrievals and enhance our understanding of the spatial behaviors of aerosols.

When comparing our aerosol retrievals to the extensive ground-based measurements collected in Baltimore, Maryland, we encountered greater uncertainties of aerosol compositions. It is a result from both the complex terrain structures of Baltimore and its various aerosol emission sources. Chapter 4, as result, extends the flexibility of our previous aerosol retrievals by incorporating a complete set of the eight commonly observed types of aerosols. The consequential rise in model complexity is met by a warm-start Markov chain Monte Carlo sampling scheme. We first design two Markov sub-chains, each representing an aerosol mixture containing only four types of the commonly observed aerosols. Combining the samples generated by these two sub-chains, we propose an initialization for the Markov chain that contains all eight types of commonly observed aerosols. Partial information on the interactions of different types of aerosols from the samples generated by the sub-chains proves to be useful in choosing a more efficient initial point for the complete Markov chain. Faster computation is achieved without compromising the retrieval accuracy nor the spatial resolution of the estimated aerosol optical depth. In the end, through case studies of aerosol retrievals for the Baltimore area, we explore the potentials of remote-sensed retrievals in improving our understanding of aerosol compositions.