The increased popularity of social media and the copious amount of user-generated data in the last few years have impacted various aspects of individuals’ lives. The use of social media for health care related purposes, which is the focus of this thesis, has increased exponentially.
This provides the researchers with a massive volume of data that can augment traditional health-related data sources (like electronic medical records) if properly mined and analyzed. Despite the advances in text analytics, it is challenging to analyze this data, due to its specialized vocabulary, the data collection, and the missing values.
In this thesis, we focus on two research directions: (a) Analyzing the demographics of users who participate in health-related social media, along with their posted content across a wide range of sources, and highlight specific health issues reported by users. (b) Effectively querying health-related social media or other health-related documents (can be generalized to the problem of querying annotated document).
Specifically, in our first contribution, we study the demographics of users who participate in health-related social media, to identify possible links to health care disparities. Using these demographics, our second contribution analyzes the content of posts grouped by demographic segments by implementing information extraction methods to extract medical concepts, identify top distinctive terms, and measure sentiment and emotion. We also extend our content analysis in the third contribution by studying the intent of posts generated by users for different data sources. Lastly, we focus on a specific domain, electronic cigarettes, and analyze the health-related effects reported by online users.
In the second direction of this thesis, we developed a query framework to help users efficiently explore health-related data, present in either online social media or other medical documents, by exploiting the relationships between the network users or the concepts inside the documents. Our solution is generalized to other domains with similar properties, such as general purpose social networks. We refer to this problem as keyword querying on graph-annotated documents, where we query documents annotated by interconnected entities, which are related to each other through association graphs. Our novel framework balances the importance of text relevance and semantic relevance through the graph.