UC San Diego
Topic Modeling of Hierarchical Corpora /
- Author(s): Kim, Do-kyum
- et al.
The sizes of modern digital libraries have grown beyond our capacity to comprehend manually. Thus we need new tools to help us in organizing and browsing large corpora of text that do not require manually examining each document. To this end, machine learning researchers have developed topic models, statistical learning algorithms for automatic comprehension of large collections of text. Topic models provide both global and local views of a corpus; they discover topics that run through the corpus and describe each document as a mixture of the discovered topics. In this dissertation, I consider the topic modeling of corpora whose documents are organized in a multi-level hierarchy. My interest in this subject arose from the need to analyze two sprawling, real-world corpora from the field of computer security. The first is a collection of job postings on a crowdsourcing site, where many advertisers seek cheap human labor for different forms of Web service abuse. I view this corpus as a three- layer tree in which an interior node represents a buyer, and children of the interior node represent the buyer's postings. The second corpus is a collection of threads from an underground Internet forum, where blackhat operatives discuss tactics for abusive forms of Internet marketing such as spamming and search engine optimization. The subforums and threads in this data set form a five- level deep hierarchy. Using these two data sets as test beds, I develop topic models that incorporate hierarchies in corpora. The models I consider can be viewed as special (finite-dimensional) instances of hierarchical Dirichlet processes (HDPs). For these models I show that there exists a simple variational approximation for probabilistic inference and demonstrate a parallel inference algorithm that can scale to corpora with deep hierarchies and large numbers of documents. On several hierarchical corpora, I show advantages of my topic models over other topic models that do not consider hierarchies. Also I compare my variational method to existing implementations of HDPs and find that my approach is faster than Gibbs sampling and able to learn more predictive models than existing variational methods