Application of Latent Dirichlet Allocation in Online Content Generation
In this paper, I apply latent dirichlet allocation(LDA) to cluster 100,000 health related articles using the livestrong.com data set. I first review the previous research progress in topic modeling. Then I introduce how LDA model is constructed. In stead of using simple word counts as model inputs, Part-of-Speech(POS) tagging and Term-Frequency Inverse Document Frequency(tf-idf) transformation are performed in data preprocessing steps in order to improve training efficiency and model interpretability. I further discuss the choices of model parameters, evaluating of model performance and visualization of model outputs from a real world point of view. Finally, I discuss two variations of conventional LDA including paralleled LDA and Online LDA. In addition to a traditional perplexity measure, I discuss how to use cosine similarity and Symmetric Kullback-Leibler Divergence to evaluate clustering performance. Three examples of using LDA outputs as building blocks for more complicated machine learning system are also demonstrated: 1) Cascaded LDA for taxonomy building. 2) In-cluster similarity computing. 3) Auto categorization.