With the growing increase in the use of the internet in most parts of the world today, users generate significant amounts of online text on different platforms such as online social networks, product review websites, travel blogs, to name just a few. The variety of content on these platforms has made them an important resource for researchers to gauge user activity, determine their opinions and analyze their behavior, without having to perform monetarily and temporally expensive surveys. Gaining insights into user behavior enables us to better understand their likes and dislikes, which in turn is helpful for economic purposes such as marketing, advertising and recommendations. Further, owing to the fact that online social networks have recently been instrumental in socio-political revolutions such as the Arab Spring, and for awareness-generation campaigns by MoveOn.org and Avaaz.org, analysis of online data can uncover user preferences.
The overarching goal of this Ph.D. thesis is to pose some research questions and propose solutions, mostly pertaining to user opinions and attributes, keeping in mind the large quantities of noise present in online textual data. This thesis illustrates that with the extraction of informative textual features and the use of robust NLP and machine learning techniques, it is possible to perform efficient signal extraction from online text data, and use it to better understand user behavior. The first research problem addressed is that of opinion detection and sentiment analysis of users on a given topic, from their self-generated tweets. The key idea is to select relevant hashtags and n-grams using an $l_1$-regularized logistic regression model for opinion detection. The second research problem deals with temporal opinion detection from tweets, i.e., detecting user opinions on a topic in which the conversation evolves over time. For instance, on the widely-discussed topic of Obamacare (the Affordable Care Act in the U.S.), various issues became the focal points of discussion among users over time, as corresponding socio-political events and occurrences took place in real-time. We propose a machine-learning model based on seminal work from the sociological literature that is based on the premise that most opinion changes occur slowly over time. Our model is able to successfully capture opinions over time using publicly available tweets, as well as to uncover the key points of discussion as time progresses. In the third research problem, we utilize distributed representation of words in a method that determines, from user reviews, aspects of products and services that users like and dislike. We harness the contextual similarity between words and effectively build meta-features that capture user sentiment at a granular level. Finally in the fourth research problem, we propose a method to detect the age of users from their publicly available tweets. Using a method based on distributed representation of words and clustering, we are able to achieve high accuracies in age detection, as well as to simultaneously discover topics of conversation in which users of different age groups engage.