Automated Analysis of User-Generated Content on the Web
- Author(s): Rivas, Ryan
- Advisor(s): Hristidis, Vagelis
- et al.
Social media users generate large volumes of data every day. Analysis of this data is an important tool in several areas. For example, the study of users' opinions, behaviors, and topics of discussion can be of use in the field of health care. The first part of my research involves using existing tools to perform analysis of Web content. Specifically, it first compares between health care provider attributes and quality measures of the insurance plans they accept. This is followed by analyses of how real estate prices and related metrics are affected by proximity to a university or hospital. Further, this research studies user behaviors and discussion topics to find differences in how various demographic groups generate content on health-related Web forums and on health-related discussions in general social media. The remainder of this dissertation shifts its focus from analysis of Web content to proposing new tools to perform similar analyses. It first proposes and evaluates natural language processing-based methods to automatically classify patient opinions in doctor reviews. This work also introduces a variant of the review classification problem where class labels can represent two opposing opinions that are not necessarily positive or negative. This is followed by an exploration of methods to effectively filter social media posts according to a user's interests. The key challenge behind this work is to determine how to use this information to maximize a trained text classification model's performance in classifying new posts. Finally, this dissertation proposes a multimodal Twitter embedding model that can leverage information from several parts of a tweet, such as text, image, and location. Such a model can have several applications for both researchers and Twitter users without the need to train a separate model for each application.