Utilizing the structure of a probabilistic model can significantly
increase its compression efficiency and learning speed. We consider
these potential improvements under two naturally-omnipresent
structures.
Power-Law: English words and many other natural phenomena are
well-known to follow a power-law distribution. Yet this ubiquitous
structure has never been shown to help compress or predict these
phenomena. It is known that the class of unrestricted distributions
over alphabet of size k and blocks of length n can never be
compressed with diminishing per-symbol redundancy, when k>n. We
show that under power-law structure, in expectation we can compress
with diminishing per-symbol redundancy for k growing as large as
sub-exponential in n.
For learning a power-law distribution, we rigorously explain the
efficacy of the absolute-discount estimator using less pessimistic
notions. We show that (1) it is adaptive to an effective
dimension and (2) it is strongly
related to the Good--Turing estimator and inherits its
competitive properties.
Low-Rank: We study learning low-rank conditional probability
matrices under expected KL-risk. This choice accentuates smoothing,
the careful handling of low-probability elements. We define a loss
function, determine sample-complexity bound for its global minimizer,
and show that this bound is optimal up to logarithmic terms. We
propose an iterative algorithm that extends classical non-negative
matrix factorization to naturally incorporate additive smoothing and
prove that it converges to the stationary points of our loss function.
Power-Law and Low-Rank: We consider learning distributions in
the presence of both low-rank and power-law structures. We study
Kneser-Ney smoothing, a successful estimator for the N-gram language
models through the lens of competitive distribution estimation. We
first establish some competitive properties for the contextual
probability estimation problem. This leads to Partial Low Rank,
a powerful generalization of Kneser-Ney that we conjecture to have
even stronger competitive properties. Empirically, it significantly
improves the performance on language modeling, even matching the
feed-forward neural models, and gives similar gains on the task of
predicting attack types for the Global Terrorism Database.