Skip to main content
eScholarship
Open Access Publications from the University of California

UCLA

UCLA Electronic Theses and Dissertations bannerUCLA

Application of Latent Dirichlet Allocation in Online Content Generation

Abstract

In this paper, I apply latent dirichlet allocation(LDA) to cluster 100,000 health related articles using the livestrong.com data set. I first review the previous research progress in topic modeling. Then I introduce how LDA model is constructed. In stead of using simple word counts as model inputs, Part-of-Speech(POS) tagging and Term-Frequency Inverse Document Frequency(tf-idf) transformation are performed in data preprocessing steps in order to improve training efficiency and model interpretability. I further discuss the choices of model parameters, evaluating of model performance and visualization of model outputs from a real world point of view. Finally, I discuss two variations of conventional LDA including paralleled LDA and Online LDA. In addition to a traditional perplexity measure, I discuss how to use cosine similarity and Symmetric Kullback-Leibler Divergence to evaluate clustering performance. Three examples of using LDA outputs as building blocks for more complicated machine learning system are also demonstrated: 1) Cascaded LDA for taxonomy building. 2) In-cluster similarity computing. 3) Auto categorization.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View