## Type of Work

Article (121) Book (0) Theses (19) Multimedia (0)

## Peer Review

Peer-reviewed only (135)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (2)

## Publication Year

## Campus

UC Berkeley (26) UC Davis (12) UC Irvine (34) UCLA (12) UC Merced (11) UC Riverside (2) UC San Diego (17) UCSF (21) UC Santa Barbara (3) UC Santa Cruz (2) UC Office of the President (15) Lawrence Berkeley National Laboratory (19) UC Agriculture & Natural Resources (0)

## Department

Research Grants Program Office (RGPO) (15) School of Medicine (14) Donald Bren School of Information and Computer Sciences (5) Beckman Laser Institute & Medical Clinic (4) Department of Epidemiology and Biostatistics (2) Department of Pharmaceutical Sciences, UCI (2)

## Journal

Proceedings of the Annual Meeting of the Cognitive Science Society (7)

## Discipline

Physical Sciences and Mathematics (9) Social and Behavioral Sciences (7) Medicine and Health Sciences (4) Life Sciences (2)

## Reuse License

BY - Attribution required (19)

## Scholarly Works (141 results)

ID3's information gain heuristic is well-known to be biased towards multi-valued attributes. This bias is only partially compensated by the gain ratio used in C4.5. Several other alternatives have been proposed and are examined here (distance, orthogonality, a Beta function, and two chi-squared tests). Gain ratio and orthogonality are strongly correlated, and all of these metrics are biased towards splits with one or more small expected values, under circumstances where the split likely ocurred by chance. Both classical and Bayesian statistics lead to the multiple hypergeometric distribution as the exact posterior probability of the null hypothesis. Both gain and the chi-squared tests are shown to arise in asymptotic approximations to the hypergeometric, revealing similar criteria for admissibility and showing the nature of their biases. Previous failures to find admissible stopping rules in CART and IDS are traced to coupling these biased approximations with one another or with arbitrary thresholds; problems which are overcome by the hypergeometric. Empirical results show that hypergeometric pre-pruning should be done, as trees pruned in this way are more practical, simpler, more efficient, and generally no less accurate than unpruned or post-pruned trees.

Modern technological advances have prompted massive scale data collection in many

modern fields such as artificial intelligence, and traditional sciences alike. This has led to

an increasing need for scalable machine learning algorithms and statistical methods to draw

conclusions about the world. In all data-driven procedures, the data scientist faces the

following fundamental questions: How should I design the learning algorithm and how long

should I run it? Which samples should I collect for training and how many are sufficient to

generalize conclusions to unseen data? These questions relate to statistical and computational properties of both the data and the algorithm. This thesis explores their role in the areas of non-convex optimization, non-parametric estimation, active learning and multiple testing.

In the first part, we provide insights of different flavor concerning the

interplay between statistical and computational properties of first-order type methods on

common estimation procedures. The expectation-maximization (EM) algorithm estimates

parameters of a latent variable model by running a first-order type method on a non-convex

landscape. We identify and characterize a general class of Hidden Markov Models for which

linear convergence of EM to a statistically optimal point is provable for a large initialization

radius. For non-parametric estimation problems, functional gradient descent type (also

called boosting) algorithms are used to estimate the best fit in infinite dimensional function

spaces. We develop a new proof technique showing that early stopping the algorithm instead

may also yield an optimal estimator without explicit regularization. In fact, the same

key quantities (localized complexities) are underlying both traditional penalty-based and

algorithmic regularization.

In the second part of the thesis, we explore how data collected adaptively with a constantly

updated estimation can lead to signifcant reduction in sample complexity for multiple

hypothesis testing problems. In particular, we show how adaptive strategies can be used

to simultaneously control the false discovery rate over multiple tests and return the best

alternative (among many) for each test with optimal sample complexity in an online manner.

Various factors affecting decision tree learning time are explored. The factors which consistently affect accuracy are those which directly or indirectly (as in the handling of continuous attributes) allow a greater number and variety of potential trees to be explored. Other factors, such as pruning and choice of heuristics, generally have little effect on accuracy, but significantly affect learning time. We prove that the time complexity of induction and post-processing is exponential in tree height in the worst case and, under fairly general conditions, in the average case. This puts a premium on designs which tend to produce shallower trees (e.g., multi-way rather than binary splits, and heuristics which prefer more balanced splits). Simple pruning is linear in tree height, contrasted to the exponential growth of more complex operations. The key factor influencing whether simple pruning will suffice is that the split selection and pruning heuristics should be the same and unbiased. The information gain and x^2 heuristics are biased towards unbalanced splits, and neither is an admissible test for pruning. Empirical results show that the hypergeometric function can be used for both split selection and pruning, and that the resulting trees are simpler, more quickly learned, and no less accurate than the trees resulting from other heuristics and more complex post-processing.

Advances in data acquisition and emergence of new sources of data, in recent years, have led to generation of massive datasets in many fields of science and engineering. These datasets are usually characterized by having high dimensions and low number of samples. Without appropriate modifications, classical tools of statistical analysis are not quite applicable in these "high-dimensional" settings. Much of the effort of contemporary research in statistics and related fields is to extend inference procedures, methodologies and theories to these new datasets. One widely used assumption which can mitigate the effects of dimensionality is the sparsity of the underlying parameters. In the first half of this thesis we consider principal component analysis (PCA), a classical dimension reduction procedure, in the high-dimensional setting with "hard" sparsity constraints. We will analyze the statistical performance of two modified procedures for PCA, a simple diagonal cut-off method and a more elaborate semidefinite programming relaxation (SDP). Our results characterize the statistical complexity of the two methods, in terms of the number of samples required for asymptotic recovery. The results show a trade-off between statistical and computational complexity. In the second half of the thesis, we consider PCA in function spaces (fPCA), an infinite-dimensional analog of PCA, also known as Karhunen-Loéve transform. We introduce a functional-theoretic framework to study effects of sampling in fPCA under smoothness constraints on functions. The framework generates high dimensional models with a different type of structural assumption, an "ellipsoid" condition, which can be thought of as a soft sparsity constraint. We provide a M-estimator to estimate principal component subspaces which takes the form of a regularized eigenvalue problem. We provide rates of convergence for the estimator and show minimax optimality. Along the way, some problems in approximation theory are also discussed.

Central to many statistical inference problems is the computation of

some quantities defined over variables that can be fruitfully modeled

in terms of graphs. Examples of such quantities include marginal

distributions over graphical models and empirical average of

observations over sensor networks. For practical purposes, distributed

message-passing algorithms are well suited to deal with such

problems. In particular, the computation is broken down into pieces

and distributed among different nodes. Following some local

computations, the intermediate results are shared among neighboring

nodes via the so called messages. The process is repeated until the

desired quantity is obtained. These distributed inference algorithms

have two primary aspects: statistical properties, in which

characterize how mathematically sound an algorithm is, and

computational complexity that describes the efficiency of a particular

algorithm. In this thesis, we propose low-complexity (efficient),

message-passing algorithms as alternatives to some well known

inference problems while providing rigorous mathematical analysis of

their performances. These problems include the computation of the

marginal distribution via belief propagation for discrete as well as

continuous random variables, and the computation of the average of

distributed observations in a noisy sensor network via gossip-type

algorithms.

Modern science and engineering often generate data sets with a large sample size and a comparably large dimension which puts classic asymptotic theory into question in many ways. Therefore, the main focus of this thesis is to develop a fundamental understanding

of statistical procedures for estimation and hypothesis testing from a non-asymptotic point of view, where both the sample size and problem dimension grow hand in hand. A range of different problems are explored in this thesis, including work on the geometry of hypothesis testing, adaptivity to local structure in estimation, effective methods for shape-constrained problems, and early stopping with boosting algorithms.

Our treatment of these different problems shares the common theme of emphasizing the underlying geometric structure.

To be more specific, in our hypothesis testing problem, the null and alternative are specified by a pair of convex cones. This cone structure makes it possible for a sharp characterization of the behavior of Generalized Likelihood Ratio Test (GLRT) and its optimality property. The problem of planar set estimation based on noisy measurements of its support function, is a non-parametric problem in nature. It is interesting to see that estimators can be constructed such that they are more efficient in the case when the underlying set has a simpler structure, even without knowing the set beforehand. Moreover, when we consider applying boosting algorithms to estimate a function in reproducing kernel Hibert space (RKHS), the optimal stopping rule and the resulting estimator turn out to be determined by the localized complexity of the space.

These results demonstrate that, on one hand, one can benefit from respecting and making use of the underlying structure (optimal early stopping rule for different RKHS); on the other hand, some procedures (such as GLRT or local smoothing estimators) can achieve better performance when the underlying structure is simpler, without prior knowledge of the structure itself.

To evaluate the behavior of any statistical procedure, we follow the classic minimax framework and also discuss about more refined notion of local minimaxity.