Declarative Frameworks and Optimization Techniques for Developing Scalable Advanced Analytics over Databases and Data Streams
- Author(s): Das, Ariyam
- Advisor(s): Zaniolo, Carlo
- et al.
In the past, the semantic issues raised by the non-monotonic nature of aggregates often prevented their use in the recursive statements of logic programs and deductive databases. However, the recently introduced notion of Pre-Mappability (PreM) has shown that, in key applications of interest, aggregates can be used in recursion to optimize the perfect-model semantics of aggregate-stratified programs. Therefore, we can preserve the declarative formal semantics of such programs, while achieving a highly efficient operational semantics that is conducive to scalable implementations on parallel and distributed platforms.
In this work, we show that using PreM, a wide spectrum of classical algorithms, ranging from graph analytics and dynamic programming based optimization problems to data mining, machine learning and online streaming applications can be concisely expressed in declarative languages by using aggregates in recursion. We present a concise analysis of this very general property and characterize its different manifestations for different constraints and rules.
Next, we prove that PreM-optimized plans are easily parallelizable and produce the same results as the single executor programs. Thus, PreM can be trivially assimilated into the data-parallel computation plans of different distributed systems, irrespective of whether these follow bulk synchronous parallel (BSP) or asynchronous computing models. This makes possible many advanced BigData applications to be now expressed declaratively in logic-based languages, including Datalog, Prolog, and even SQL, while enabling their execution with superior performance and scalability as compared to other specialized systems. Furthermore, we show that under PreM nonlinear recursive queries can be evaluated using a hybrid stale synchronous parallel (SSP) model with relaxed synchronization on distributed environments. We present empirical evidence of its benefits. We also compare the usability, expressivity and performance of PreM-optimized queries with queries written in quasi-declarative programming methodologies inspired by procedural languages like XY-stratification to showcase the different trade-offs and ramifications associated with each.
Lastly, we present robust online optimization techniques using two popular case studies, namely online lossless frequent pattern mining and online decision tree construction, to show how compact representations and statistical approximations can deliver superior performances in real-time for several streaming data mining and machine learning applications.