In the era of real time big data stream processing, the past decade has witnessed emergence of wide variety of applications to tap into live social data feed to gain awareness about events, opinions and sentiments of the community. Social data has the potential to bring new functionalities and improvements in a wide variety of domains from Emergency Response to Political Analysis. Thus, there is an increasing demand for executing up-to-the-minute analysis tasks on top of these dynamic data sources by modern applications. Such new requirements have created new challenges for traditional data processing techniques. In this thesis, we respond to some of these challenges.
First, we explore the problem of online adaptive topic focused tweet acquisition. Specifically, we propose a Tweet Acquisition System (TAS), that iteratively selects phrases to track according to a reinforcement learning algorithm. The selection follows an explore-exploit policy to approximate the effectiveness of different phrases in retrieving relevant tweets based on Bayesian inferences. We also develop a tweet relevance model, which enables checking the relevance of collected tweets to the topic of interest based on multiple criteria. The objective of TAS is to improve the recall of collected relevant tweets. Our experimental studies show significant improvements over the state-of-the-art, furthermore the performance gap increases when the topics are more specific.
Subsequently, efficient processing of top-k mentioned entities query posed on a stream of tweets has become a key part of a broad class of real-time applications, ranging from content search to marketing. Given that words are often ambiguous, entity linking becomes an important step towards answering such queries. Furthermore, the continuous and fast generation of tweets makes it crucial for such applications to process those queries at an equally fast pace. In order to address these requirements, we propose TkET (pronounced ticket) as an analysis-aware entity linking framework for efficiently answering top-k entities query over Twitter stream in an sliding window fashion. The comprehensive empirical evaluation of the proposed solution demonstrates its significant advantage in terms of efficiency over the traditional techniques for the given problem settings.