With the proliferation of user-generated data, many emerging applications consume this data to serve various important domains, such as natural disaster management, citizen journalism, social recommendations, targeted advertising, and scientific research. This data mostly comes in streaming fashion with tremendous high rates and adds up to large archives of historical data. This dissertation studies indexing and querying this data in different contexts in order to support low latency queries.
First, we evaluate ten different indexes that support spatial-keyword queries on streaming data at the system level. These queries, namely range query and $k$-nearest neighbors, are extended to include the time dimension in addition to the space and keywords to effectively serve streaming spatial data applications. Supporting such queries on streaming environment is challenging as streaming data comes in a very high rate, and query answer is likely changing around the clock. The extensive evaluation provides insights for the system builders on the potential loss and gain of employing one index over the others from the system perspectives.
Second, we introduce two new spatial-temporal personalized queries that tailor the results to the query issuer based on the user’s social network. In addition, we propose a scalable geo-social indexing framework which digests real-time geo-social data. The framework distinguishes highly-dynamic data from relatively stable data and uses appropriate data structures and storage tier for each. The query processor utilizes this framework to support real-time query response and minimal overhead on the system resources by employing different types of pruning.
Lastly, we study community-centric queries on user-generated data that capture the community interests over time. Understanding a specific community is very crucial in making decisions and writing policies. A novel indexing paradigm is proposed to efficiently digest the community interactions. Furthermore, we develop scalable techniques, exact and approximate, to find the top-k that a specific community interacted the most during a given time. The proposed techniques smartly prune the search space to provide a low query latency.