## Type of Work

Article (64) Book (0) Theses (14) Multimedia (0)

## Peer Review

Peer-reviewed only (73)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (1)

## Publication Year

## Campus

UC Berkeley (16) UC Davis (2) UC Irvine (16) UCLA (4) UC Merced (7) UC Riverside (2) UC San Diego (5) UCSF (4) UC Santa Barbara (2) UC Santa Cruz (2) UC Office of the President (8) Lawrence Berkeley National Laboratory (20) UC Agriculture & Natural Resources (0)

## Department

Research Grants Program Office (RGPO) (8) Center for Studies in Higher Education (1) Department of Earth System Science (1) Institute for Clinical and Translational Science (1) Microbiology and Plant Pathology (1) Sue & Bill Gross School of Nursing (1)

## Journal

Proceedings of the Annual Meeting of the Cognitive Science Society (7)

## Discipline

Social and Behavioral Sciences (7) Medicine and Health Sciences (4) Physical Sciences and Mathematics (4) Life Sciences (2)

## Reuse License

BY - Attribution required (11) BY-NC-SA - Attribution; NonCommercial use; Derivatives use same license (1)

## Scholarly Works (79 results)

Modern technological advances have prompted massive scale data collection in many

modern fields such as artificial intelligence, and traditional sciences alike. This has led to

an increasing need for scalable machine learning algorithms and statistical methods to draw

conclusions about the world. In all data-driven procedures, the data scientist faces the

following fundamental questions: How should I design the learning algorithm and how long

should I run it? Which samples should I collect for training and how many are sufficient to

generalize conclusions to unseen data? These questions relate to statistical and computational properties of both the data and the algorithm. This thesis explores their role in the areas of non-convex optimization, non-parametric estimation, active learning and multiple testing.

In the first part, we provide insights of different flavor concerning the

interplay between statistical and computational properties of first-order type methods on

common estimation procedures. The expectation-maximization (EM) algorithm estimates

parameters of a latent variable model by running a first-order type method on a non-convex

landscape. We identify and characterize a general class of Hidden Markov Models for which

linear convergence of EM to a statistically optimal point is provable for a large initialization

radius. For non-parametric estimation problems, functional gradient descent type (also

called boosting) algorithms are used to estimate the best fit in infinite dimensional function

spaces. We develop a new proof technique showing that early stopping the algorithm instead

may also yield an optimal estimator without explicit regularization. In fact, the same

key quantities (localized complexities) are underlying both traditional penalty-based and

algorithmic regularization.

In the second part of the thesis, we explore how data collected adaptively with a constantly

updated estimation can lead to signifcant reduction in sample complexity for multiple

hypothesis testing problems. In particular, we show how adaptive strategies can be used

to simultaneously control the false discovery rate over multiple tests and return the best

alternative (among many) for each test with optimal sample complexity in an online manner.

Central to many statistical inference problems is the computation of

some quantities defined over variables that can be fruitfully modeled

in terms of graphs. Examples of such quantities include marginal

distributions over graphical models and empirical average of

observations over sensor networks. For practical purposes, distributed

message-passing algorithms are well suited to deal with such

problems. In particular, the computation is broken down into pieces

and distributed among different nodes. Following some local

computations, the intermediate results are shared among neighboring

nodes via the so called messages. The process is repeated until the

desired quantity is obtained. These distributed inference algorithms

have two primary aspects: statistical properties, in which

characterize how mathematically sound an algorithm is, and

computational complexity that describes the efficiency of a particular

algorithm. In this thesis, we propose low-complexity (efficient),

message-passing algorithms as alternatives to some well known

inference problems while providing rigorous mathematical analysis of

their performances. These problems include the computation of the

marginal distribution via belief propagation for discrete as well as

continuous random variables, and the computation of the average of

distributed observations in a noisy sensor network via gossip-type

algorithms.

Advances in data acquisition and emergence of new sources of data, in recent years, have led to generation of massive datasets in many fields of science and engineering. These datasets are usually characterized by having high dimensions and low number of samples. Without appropriate modifications, classical tools of statistical analysis are not quite applicable in these "high-dimensional" settings. Much of the effort of contemporary research in statistics and related fields is to extend inference procedures, methodologies and theories to these new datasets. One widely used assumption which can mitigate the effects of dimensionality is the sparsity of the underlying parameters. In the first half of this thesis we consider principal component analysis (PCA), a classical dimension reduction procedure, in the high-dimensional setting with "hard" sparsity constraints. We will analyze the statistical performance of two modified procedures for PCA, a simple diagonal cut-off method and a more elaborate semidefinite programming relaxation (SDP). Our results characterize the statistical complexity of the two methods, in terms of the number of samples required for asymptotic recovery. The results show a trade-off between statistical and computational complexity. In the second half of the thesis, we consider PCA in function spaces (fPCA), an infinite-dimensional analog of PCA, also known as Karhunen-Loéve transform. We introduce a functional-theoretic framework to study effects of sampling in fPCA under smoothness constraints on functions. The framework generates high dimensional models with a different type of structural assumption, an "ellipsoid" condition, which can be thought of as a soft sparsity constraint. We provide a M-estimator to estimate principal component subspaces which takes the form of a regularized eigenvalue problem. We provide rates of convergence for the estimator and show minimax optimality. Along the way, some problems in approximation theory are also discussed.

The Neoproterozoic (~750-635 Ma) Kingston Peak Formation, southeastern California, is a coarse-grained siliciclastic interval, with laterally extensive carbonate marker horizons, deposited in extensional basins between two regionally extensive carbonate intervals.

Thirty sections measured and two geologic maps produced show a wedge-shaped geometry unique to extensional settings and clarify the conformable relationship between the coarse-grained deposits and the overlying Noonday Dolomite. Carbonate intervals were sampled extensively to determine the value of chemostratigraphic correlation in this interval. A newly mapped regional unconformity near the base of the formation serves to separate the overlying tectonic sequence of the Kingston Peak Formation from the underlying deposits related to the platformal Beck Spring Dolomite. A glacigenic influence is inferred based on the presence of striated clasts in one of several basins, facilitating global correlation with similar coarse-grained deposits thought to record the Earth's most severe ice age.

The Kingston Peak Formation provides a rare example of ancient glacial successions in which the relationship between the sedimentary packaging in vertical and lateral dimensions is apparent in outcrop. This allows the influence on stratigraphic development by the series of tectonic and climate events to be reconstructed without relying on regional or global correlation. These relations show the progressive development of extensional basins from northwest to southeast in the Death Valley region. The exceptional exposure in this region reveals bounding synsedimentary faults allowing tectonic and climate influence on coarse-grained facies to be resolved as well as the lateral persistence and stacking of course grained units. Through-going carbonate marker beds recording regional sea level rise provide timelines allowing the reconstruction and relative timing of climate and rifting events. These relations identify that the Kingston Peak Formation records a complicated regional history in which the record of rifting and climate are intimately related through fault subsidence and the creation of accommodation space.

The availability of accommodation space from tectonism biases the sedimentary record of climate change. Glacial deposits are not necessarily uniquely timed with glacial conditions, but with preservational conditions. This interplay between tectonism and related coarse-grained deposits obscures both the timing and extent of similar coarse-grained deposits related to glaciation.

- 4 supplemental PDFs