Algorithmic Approaches to Statistical Questions
We live in a probabilistic world---a world full of distributions from which we sample. Learning, evolution, and much of science, rely on samples furnished by nature. This prompts the basic question: Given a sample from some unknown distribution, what can one infer? In even the simplest settings, our understanding of this question is startlingly poor. While this question is traditionally viewed as lying within the province of statistics and information theory, at its core it is an algorithmic question. The increasing size of our datasets---and perhaps more importantly, the increasing complexity of the underlying distributions that we hope to understand---are exposing issues that seem to demand computational consideration.
In this dissertation, we apply the computational perspective to three basic statistical questions which underlie and abstract several of the challenges encountered in the analysis of today's large datasets.
Estimating Statistical Properties:
Given a sample drawn from an unknown distribution, and a specific statistical property of the distribution that we hope to estimate, how should one compute that estimate, and what sample size is necessary to guarantee that with high probability, the computed estimate is accurate? We focus on a large and natural class of properties, which includes the number of distinct elements, entropy, and distance metrics between pairs of distributions, including total variational distance (also known as statistical distance or L1 distance). Such properties are easy to estimate if the sample size is large in comparison to the size or complexity of the underlying distribution, but what can be done given relatively few samples? Our results can be interpreted via the following three concrete problems, each defined with respect to an arbitrarily small accuracy parameter d > 0:
Distinct Elements: Given access to n buckets, each of which contains one object that is not necessarily distinct from those in the others buckets, how many buckets must one inspect in order to estimate the total number of distinct objects to +/- dn, with high probability?
Entropy Estimation: Given access to a sample obtained by taking independent draws from a distribution, p, of support size at most n, how large does the sample need to be to estimate the entropy of the distribution, H(p), to within +/-d, with high probability?
Distance: Given access to two samples obtained by taking independent draws from two distributions, p, q, of support size at most n, how large do the samples need to be to estimate the total variational distance between the distributions (also referred to as L1 distance or "statistical distance"), to within +/- d, with high probability?
We show that sublinear sample estimation is possible: for any constant d > 0, the above estimation tasks can be accomplished using samples of size O(n/log n), with probability of success 1-o(1). Additionally, we prove some results about the algorithmic structure of optimal estimators, and provide experimental evidence suggesting that our estimators perform very well in practice. Complementing these positive results, we prove matching information theoretic lower bounds, establishing the sample complexity of these tasks up to constant factors. Previously, no explicit sublinear sample estimators had been described for any of these tasks.
As a component of the lower bound machinery, we prove two new multivariate central limit theorems: one for sums of independent (though not necessarily identical) multivariate random variables in the Wasserstein metric, and the second for "generalized multinomial distributions" (a class of distributions generalizing binomial, multinomial, and sums of such distributions) in terms of the stringent L1 distance metric. We suspect these limit theorems may have broader applications beyond the property estimation setting.
Finding Correlations and Identifying Relevant Variables:
Perhaps the most basic type of structure that can be present in a dataset is correlation. How much computation is required to find correlated variables? One can certainly brute-force search through all pairs of variables, and for each pair, the correlation can be estimated very efficiently. But is there a sub-quadratic time algorithm for finding correlated variables? More generally, suppose one has a data set where each data sample has a label which is given as some function of a small number of the variables. If we have n total variables, perhaps there is a small number, k=3,4,5,... of "relevant" variables which can be used to predict the labels. Such a function is termed a "k- junta". How quickly can one find this set of k relevant variables? As above, one could simply perform a brute-force search over all possible subsets of size k, taking time roughly O(n^k). Can one find the set of relevant variables significantly more efficiently?
We show that a planted pair of r-correlated variables can be found in a set of n otherwise uniformly random Boolean variables in time n^1.6 poly(1/r). This improves upon the n^(2-O(r)) runtime given by locality sensitive hashing and related approaches. Extensions of this algorithm yield significantly improved algorithms for several important problems, including multiplying matrices whose product is guaranteed to be sparse, learning k-juntas, learning sparse parity with noise, and computing the approximate closest pair of points, in both Euclidean and Boolean settings.
Learning Mixtures of Gaussians:
A sample from a mixture model (with, for example, two components) is generated via the following process: for each data point, with some probability, w, the point is drawn from one distribution, p, and with the remaining probability, 1-w, the point is drawn from a second distribution q. Supposing one is given a large sample from such a mixture of distributions, can one efficiently deduce the components of the mixture? Can one accurately cluster the sample points according to the distribution from which they originated? In the special case in which each component is a Gaussian distribution, this is the problem of learning a "Gaussian mixture model", and is, perhaps, the most natural (and practically relevant) starting point for tackling the question of recovering mixtures of more general families of distributions. We obtain a basic handle on the sample and computational complexity of this problem, and describe an algorithm which, given a sample from a GMM with any constant number of components, provably returns accurate estimates of the components, with runtime and sample size polynomial in the relevant parameters---the dimension of the space, and the inverse of the desired accuracy of the recovered components. Previously, no such algorithm was known, even in the special case of univariate mixtures with just two components.
The questions considered in this dissertation are not new: the question of efficiently finding correlations was introduced to the computer science community over 25 years ago; the effort to describe accurate estimators for entropy and the other properties that we consider originated in the statistics community nearly 75 years ago and also received significant attention from the information theory community; the question of recovering Gaussian mixture models was originally posed by Karl Pearson in the 1890's. The progress on these questions that we describe in this dissertation stems from the observation that these statistical questions are intrinsically algorithmic and hence may be amenable to the tools, techniques, and perspectives of theoretical computer science.