Social media has changed manners in which people access information, form their opinion and act in real life.
Therefore, there is an urgent need to design information retrieval systems to turn large scale unstructured users’ data into structured knowledge. Traditional text summarization techniques and co-occurrence based topic models, however, cannot capture the complex social dynamics that drive individual and group behaviors. On the other hand, well-known models of narratives and legends that have been proven to be effective in capturing story dynamics do not have a scalable machine learning formulation. The main goal of this dissertation is to develop computational and statistical tools that can efficiently and accurately extract multi-scale narrative structures from large-scale social media datasets.
In particular, a narrative is modeled as a "Story Narratives Networks" comprised of nodes that represent primary actants, which interact via a sequence of actions that define the links in the network. One of the contributions of this dissertation is to determine distinct actant groups in an unsupervised manner from contextual unstructured data. Each such group consists of actors that have the same contextual role in the narrative. In order to cluster actors, we construct low-dimensional sparse vector embeddings using dimensionality reduction techniques such as Non-Negative Matrix Factorization (NMF). We propose an exterior point method to solve the NMF problem, which constructs a solution based on a suitably rotated optimal solution of the unconstrained matrix factorization problem. We evaluate the performance of our proposed algorithm and embedding-based clustering scheme on two datasets, namely data from a discussion forum on parenting issues and a corpus of tweets on user experience with contact-less payment methods.
Finally, towards understanding the dynamics in the evolution of stories, we study the problem of detecting changes in the temporal evolution of the user activities.
We formulate this problem in a transient change point detection setting and design a statistical test to detect the change based on the number of user activities observed so far, with minimum expected delay under a controlled measure of false alarm. We evaluate the change detection method on a corpus of tweets related to Super Bowl 2015.