UC San Diego
Estimation and Compression Over Large Alphabets /
- Author(s): Acharya, Jayadev
- et al.
Compression, estimation, and prediction are basic problems in Information theory, statistics and machine learning. These problems have been extensively studied in all these fields, though the primary focus in a large portion of the work has been on understanding and solving the problems in the asymptotic regime, i.e., the alphabet size is fixed and the length of observations grow. Compression of long i.i.d. sequences over a small alphabet has been studied extensively. Kieffer, Davisson, and others showed that there is no good scheme for compression of i.i.d. distributions over an infinite alphabet. With the advent of data with larger underlying alphabet size over the past decade, researchers have considered various methods/models for which efficient compression is possible. We use redundancy, the extra number of bits beyond the optimal (entropy) as the performance metric. We consider three general models to address compression of large alphabets. The first model considers sources with only a few modes. Most natural distributions over the integers consists of only a few modes. Moreover, mixture of a few simple distributions also satisfies this property. However, even the class M of all monotone distributions over N also has infinite redundancy. In spite of this, Foster, Stine and Wyner constructed encoding schemes that have diminishing redundancy for any monotone distribution with a finite entropy. We restrict our attention to Mk, the class of monotone distributions over k alphabets. The precise redundancy of this class of distributions is characterized by Shamir for the range k = O(n), i.e., for block-length at most linear in the alphabet size. We extend the characterization and in fact show that as long as the underlying alphabet size is sub-exponential in the block- length, it is possible to compress monotone distributions with diminishing per-symbol redundancy. We extend these results to distributions with a constant number of modes, whose locations are unknown. A second elegant approach proposed by Boucheron, Garivier and Gassiat considers distributions that are bounded by an envelope. They provide characterization of the redundancy of such classes and in particular, nd bounds on the redundancy of power- law and exponential envelopes. Bontemps and later Bontemps, Boucheron and Gassiat consider the class of sub- exponential envelopes and characterize its redundancy precisely. However, these methods do not work for distributions with a heavy tail, e.g., the power-law distributions. Poisson sampling is a widely used method to introduce independence among the number of symbol appearances, and thereby simplifying the analysis of many algorithms. We show that for discrete distributions, the redundancy of Poisson sampled sequences is sufficient to characterize the redundancy of fixed length sequences. Using this, we provide simple bounds on the redundancy of envelope classes. We then demonstrate the efficacy of these bounds by proving tight bounds on the redundancy of power-law classes, answering an open question of Boucheron et al. The third approach, proposed initially by Aberg, Shtarkov and Smeets, and studied extensively by Orlitsky, Santhanam, Zhang, and Shamir, consider compressing the structure (called pattern) and dictionary of the sequence separately. In particular, they show that patterns can be compressed with redundancy that grows as O(n¹/²) with the block-length n, independent of the underlying alphabet size. This problem can also be interpreted as studying the Good-Turing probability estimation problem under logarithmic loss. We develop several simple and useful tools to bound redundancy of distribution classes and use them with Poisson sampling to show that the redundancy of patterns grows as O(n¹/³)