## Competitive tests and estimators for properties of distributions

- Author(s): Das, Hirakendu
- et al.

## Abstract

We derive competitive tests and estimators for several properties of discrete distributions, based on their i.i.d. sequences. We focus on symmetric properties that depend only on the multiset of probability values in the distributions and not on specific symbols of the alphabet that assume these values. Many applications of probability estimation, statistics and machine learning involve such properties. Our method of probability estimation, called profile maximum likelihood (PML), involves maximizing the likelihood of observing the profile of the given sequences, i.e., the multiset of symbol counts in the sequences. It has been used successfully for universal compression of large alphabet data sources, and has been shown empirically to perform well for other probability estimation problems like classification and distribution multiset estimation. We provide competitive estimation guarantees for the PML method for several such problems. For testing closeness of distributions, i.e., finding whether two given i.i.d. sequences of length n are generated by the same distribution or by two different ones, our schemes have an error probability of at most sqrt(delta) * exp(7n̂(2/3)) whenever the best possible error probability is delta <= exp(-14n̂(2/3)). The running time of our scheme is O(n). We do not make any assumptions on the distributions, including on their support size. In terms of sample complexity, this implies that if there is a closeness test which takes sequences of length n and has error probability at most delta, our tests have the same error guarantee on sequences of length n' = O(\max{n̂3/ loĝ3(1/4 delta),n}). Similar results are implied for the related problem of classification. For finding the probability multiset of a discrete distribution using a length-n i.i.d. sequence drawn from it, we show the following guarantee for the PML-based estimator. For any class of distributions and any distance metric on their probability multisets, if there is an estimator that approximates all distributions in this class to within a distance of epsilon > 0 with error probability at most delta <= exp(-6n̂(1/2)), then the PML estimator is within a distance of 2 * epsilon with error probability at most delta * exp(6n̂(1/2)). Equivalently, the PML estimator approximates distributions to within a distance of 2 * epsilon with error probability delta using sequences of length n' = O(\max{n̂2/loĝ2(1/4 delta),n}). Thus, this estimator is competitive with other estimators, including the one by Valiant et al. that approximates distributions of superlinear support size k = O(epsilon̂(2.1) * n * log(n)) to within a relative earthmover distance of epsilon and whose error probability can be shown to be at most exp(-n̂(0.9)). However, unlike the case of closeness testing, we do not yet have efficient schemes for computing the PML distribution. We extend the results for PML for distribution multiset estimation to two related problems of estimating the parameter multiset of multiple distributions or processes. These include the problems of estimating the multiset of success probabilities of Bernoulli processes, and the multiset of means of Poisson distributions