Query Optimization for Big Spatial Databases using Theoretical Analysis and Machine Learning
Spatial data is being produced at increasing rates from various sources such as mobile applications and satellite data. For example, there is an average of 500 million tweets sent every day from users at different spatial locations. NASA EOSDIS adds about 6.4 TB of data to its archives every day. These data sources urged the research community and industry to develop new systems for big spatial data. Regardless of their architecture, one of the fundamental requirements of query optimization in these systems is to spatially partition the data efficiently across machines. Existing spatial databases rely on traditional index search structures such as R-tree, STR, Kd-tree, Quad-tree, etc. These approaches are not always suitable with the demands of current big data applications. My dissertation proposes new partitioning techniques based on theoretical analysis. First, this work introduces a balanced spatial partitioning, termed R*-Grove, which provides load balanced partitions with high spatial quality. Second, this dissertation proposes an incremental spatial partitioning framework for distributed file systems that allows high ingestion rates and efficient spatial analytical queries.
The proposed systems above are built based on theoretical analysis of spatial query performance. In recent years, there are many works that employ the power of machine learning techniques to address classical problems in big data systems. Motivated by the success of these approaches, my dissertation also proposes some machine learning based systems to solve several query optimization problems in spatial databases such as spatial partitioning, selectivity estimation, and spatial join cost estimation problem. The experimental results show that machine learning is a promising approach to efficiently solve query optimization problems in big spatial databases.