The wide availability of tracking devices has drastically increased geolocation in social networks, resulting in new commercial applications; for example, marketers can identify current trending topics within a region of interest and focus their products accordingly. In this thesis, we first study a basic analytics query on geotagged data, which we refer to as the Top-k Frequent Spatiotemporal Terms (kFST) query. Given a spatiotemporal region, kFST finds the most frequent terms among the social posts in that region. While there has been prior work on keyword search and on group keyword search on spatial data, our problem is different in that it returns keywords and aggregated frequencies as output, instead of having the keyword(s) as input. Moreover, we differ from works addressing the streamed version of this query in that we operate on large, disk-resident data, and we provide exact answers. This thesis proposes an index structure and algorithms to answer kFST queries. Our index structure employs an R-tree augmented by top-k sorted term lists, where a key challenge is to balance the size of the index to achieve faster execution and smaller space requirements. We theoretically study and experimentally validate the ideal length of the stored term lists. A detailed experimental evaluation of the proposed methods' performance as compared to baselines using real datasets, shows its efficiency and better performance.
Next, we study the reverse problem, namely, the Reverse Spatial Top-k Keyword (RSK) query: given a query term q, an integer k, and a neighborhood size l, find the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods. An obvious approach would be to partition the dataset with a uniform grid structure of cell size l (a temporal window, a spatial square, or a spatiotemporal cube) and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned to the grid. What makes the problem challenging challenging is that the complete answer should identify neighborhoods placed anywhere in the search space. Furthermore, for every neighborhood that is an answer, one can easily create another neighborhood with the same data points that are also an answer (by merely moving this neighborhood in any direction as long as it contains the same set of data points). In particular, there are infinitely many such answers. Instead, we identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an index structure (that employs a uniform grid augmented by materialized sorted term lists) and algorithms to answer the RSK query efficiently. We apply various optimizations that drastically improve query latency against the baseline. We also provide a theoretical model to choose the optimal cell size to minimize query latency. We further examine an approximate version of the RSK, which leads to faster query processing. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets shows the efficiency of our optimizations and the accuracy of the proposed theoretical model.
Finally, we examine the Reverse Spatial Top-k Snapshot Query (RSKSQ), which like the RSK query, identifies areas where a keyword is popular, however, it operates over the Twitter data stream. Given a query term q, an integer k, a neighborhood size l, a time window W, and a refresh rate r, we find the neighborhoods of size l where q is in the top-k most frequent terms among the tweets in those neighborhoods. To answer RSKSQ queries, we run the RSK query over a snapshot of a fixed-sized time window (W) of the most recent tweets, i.e., tweets posted in the last W minutes, and refresh the results every r seconds. To implement this approach we use a Ring Buffer B of a fixed size to keep track of the latest tweets. If B becomes full, we discard the oldest tweets. We use an index structure consisting of a uniform grid augmented by materialized lists of term frequencies and a filter-refinement-based RSK query processing algorithm optimized for fast updates to find the answers. We have implemented a system that provides the results of RSKSQ refreshed every r seconds using a desktop application based on ArcGIS.