Query Language Extensions for Advanced Analytics on Big Data and their Efficient Implementation
Advanced analytics and other Big Data applications call for query languages that can express the complex logic of advanced analytics, and are also amenable to efficient implementations providing high throughput and low latency. Existing systems such as Hadoop or Spark can now handle large amounts of data via MapReduce enabled parallelism, but they lack simple query languages that can express declaratively applications such as common graph and data mining algorithms, and the search for complex patterns in massive data sets. Fortunately, recent advances in recursive query languages and automata theory have paved way for extending widely used declarative query languages, such as SQL, to address these problems. Thus, in this dissertation, we propose two significant new extensions to the current SQL standards and demonstrate their efficient implementations. We first propose the Recursive-aggregate-SQL language, RaSQL. RaSQL queries assure a declarative formal fixpoint semantics that is guaranteed by the PreM property, while amenable to efficient recursive query evaluation techniques based on the Semi-Naive optimization for the fixpoint computation. The RaSQL is implemented on top of Apache Spark, achieving superior scalability and performance compared to the state-of-art systems such as Apache Giraph, GraphX and Myria. Then, we propose a new Weighted Search Pattern language, WSP, which extends the SQL-TS language. WSP is able to provide semantic rankings of the query results, and its implementation and optimization are guided by the theory of weighted automata.