Motivated by diverse applications in ecology, genetics, and language modeling, researchers in learning, computer science, and information theory have recently studied several fundamental statistical questions in the large domain regime, where the domain size is large relative to the number of samples.
We study three such basic problems with rich history and wide applications. In the course of analyzing these problems, we also provide provable guarantees for several existing practical estimators and propose estimators with better guarantees.
Competitive distribution estimation and classification:
Existing theory does not explain why absolute-discounting, Good-Turing, and related estimators outperform the asymptotically min-max optimal estimators in practice. We explain their performance by showing that a variant of Good-Turing estimators performs near optimally for all distributions. Specifically, for distributions over $k$ symbols and $n$ samples, we show that a simple variant of Good-Turing estimator is always within KL divergence of $(3+o_n(1))/n^{1/3}$ from a genie-aided estimator that knows the underlying distribution up to a permutation, and that a more involved estimator is within $\tilde{\mathcal{O}_n}(\min(k/n,1/\sqrt n))$. We extend these results to classification, where the goal is to classify a test sample based on two training samples of length $n$ each.
Estimating the number of unseen species:
We study species estimation, where given $n$ independent samples from an unknown species distribution, we would like to estimate the number of new species that will be observed among the next $t \cdot n$ samples. Existing algorithms provide guarantees only for the prediction range $t \leq 1$. We significantly extend the range of predictability and prove that a class of estimators including Efron-Thisted accurately predicts the number of unseen species for $t \propto \log n$. Conversely, we show that no estimator is accurate for $t = \Omega(\log n)$.
Learning Gaussian mixtures:
We derive the first sample-efficient \newline polynomial-time estimator for high-dimensional spherical Gaussian mixtures. It estimates mixtures of $k$ spherical Gaussians in $d$-dimensions to within $\ell_1$ distance $\epsilon$ using $\mathcal{O}({dk^9(\log^2d)}/{\epsilon^4})$ samples and $\mathcal{O}_{k,\epsilon}(d^3\log^5 d)$ computation time. Conversely, we show that any estimator requires $\Omega\bigl({dk}/{\epsilon^2}\bigr)$ samples, hence the algorithm's sample complexity is nearly optimal in the number of dimensions. We also construct a simple estimator for one-dimensional Gaussian mixtures that uses $\tilde{\mathcal{O}}({k }/{\epsilon^2})$ samples and $\tilde{\mathcal{O}}(({k}/{\epsilon})^{3k+1})$ computation time.