Skip to main content
Open Access Publications from the University of California

Predictive and Interpretable Text Machine Learning Models with Applications in Political Science

  • Author(s): Kuang, Christine Yai
  • Advisor(s): Yu, Bin
  • Sekhon, Jasjeet
  • et al.

In this era, massive amounts of data are routinely collected and warehoused to be analyzed for scientific and industrial goals. Text data are a major constituent of these data treasure troves. However, with the steep increase in the amount and variety of accessible text data, it has become very difficult for a human to meaningfully analyze textual data without the help of automated text machine learning models. Topic models are one such method. They reduce the cost of analyzing large-scale corpora by identifying, in an unsupervised manner, the underlying thematic structure of the corpus. This thematic structure provides a coarse summary of the documents and allows researchers to quickly explore how topics connect with each other and change over time.

The success of automated topical analysis by topic models has led to another interesting area of text analysis: sentiment analysis. Sentiment analysis is the detecting of opinions, feelings, and general sentiments expressed in text. Sentiment analysis gained relevancy through the rise of social media platforms which increased the amount of sentiment-containing text data, such as Yelp reviews, Tweets, and opinion blogs. Efficient and effective sentiment analysis of such corpora will lead to valuable information about political and social discourse. Hence, social scientists have become increasingly interested in identifying and measuring the relationship between topics and associated sentiments to better understand social and political cultures, attitudes, and processes.

In Part 1 of this thesis, we propose a statistical model of text which simultaneously detects both topic and sentiment and allows for the inclusion of document metadata. The proposed model improves upon existing topic-sentiment models in two ways: i) the assumption that topics are associated with a range of sentiments and ii) the ability to use document-level covariates for improved estimation and analysis of the relationship between topics and sentiments. By applying the proposed model to two different datasets, i) a collection of political blogposts and ii) Yelp reviews, we demonstrate how detection of both topic and sentiment with the inclusion of document-level covariates can allow for more informative model summaries as compared to current topic and topic-sentiment models.

Topic models are easy to use and interpret; therefore, many variants of topic models have been developed to customize them to various research applications. Evaluation of topic models are thus necessary for appropriate model selection. For this reason, in part II of this thesis, we develop three new metrics which improve upon the existing evaluation approaches by identifying the benefits of topic-sentiment models over topic models.

Our evaluation metrics are based on three important criteria: sentiment prediction accuracy, feature stability, and computation time. Not only is it important to be able to show that one model achieves higher sentiment prediction accuracy over another, but it is also vital to ensure that the features used to generate a prediction are meaningful and stable, and that the algorithm has reasonable computational speed. We will use these three metrics to compare our proposed topic-sentiment model to topic models using a case study in which we aim to predict the partisanship and tone of political TV ads. Moreover, since these metrics are not specific to topic models, we will also provide a comparison of topic models with word2vec and Concise Comparative Summaries (CCS) which, to the best of our knowledge, has not been done before. We demonstrate that although the proposed topic-sentiment model is able to better predict sentiment than topic models, word2vec had the highest prediction accuracy and CCS identified the most stable features for prediction and both models required less computation time.

Main Content
Current View