Search

Scholarly Works (1 results)

Utilizing the structure of a probabilistic model can significantly

increase its compression efficiency and learning speed. We consider

these potential improvements under two naturally-omnipresent

structures.

Power-Law: English words and many other natural phenomena are

well-known to follow a power-law distribution. Yet this ubiquitous

structure has never been shown to help compress or predict these

phenomena. It is known that the class of unrestricted distributions

over alphabet of size k and blocks of length n can never be

compressed with diminishing per-symbol redundancy, when k>n. We

show that under power-law structure, in expectation we can compress

with diminishing per-symbol redundancy for k growing as large as

sub-exponential in n.

For learning a power-law distribution, we rigorously explain the

efficacy of the absolute-discount estimator using less pessimistic

notions. We show that (1) it is adaptive to an effective

dimension and (2) it is strongly

related to the Good--Turing estimator and inherits its

competitive properties.

Low-Rank: We study learning low-rank conditional probability

matrices under expected KL-risk. This choice accentuates smoothing,

the careful handling of low-probability elements. We define a loss

function, determine sample-complexity bound for its global minimizer,

and show that this bound is optimal up to logarithmic terms. We

propose an iterative algorithm that extends classical non-negative

matrix factorization to naturally incorporate additive smoothing and

prove that it converges to the stationary points of our loss function.

Power-Law and Low-Rank: We consider learning distributions in

the presence of both low-rank and power-law structures. We study

Kneser-Ney smoothing, a successful estimator for the N-gram language

models through the lens of competitive distribution estimation. We

first establish some competitive properties for the contextual

probability estimation problem. This leads to Partial Low Rank,

a powerful generalization of Kneser-Ney that we conjecture to have

even stronger competitive properties. Empirically, it significantly

improves the performance on language modeling, even matching the

feed-forward neural models, and gives similar gains on the task of

predicting attack types for the Global Terrorism Database.