Declarative Languages and Systems for Transparency, Performance and Scalability in Database Analytics
- Author(s): Li, Youfu
- Advisor(s): Zaniolo, Carlo
- et al.
Demand for powerful, high-performance analytics on Big Data is ever growing. Developing tools and methodologies for advanced Database analytics, such as Data Mining applications, has long been an active area of research which posed elusive challenges to both academia and industry, on topics that include: 1) design of expressive high-level languages with declarative semantics for data analytics, 2) optimization and parallelization for efficient and scalable execution, and 3) transparency of analytics dataflow for error tracking and debugging. This thesis proposes methods and tools for developing powerful data analytics systems based on declarative languages, dataflow inspection and query optimization. By leveraging and integrating these tools we obtain i) a scalable data analytics framework for knowledge discovery by concise and declarative queries, ii) a unified solution that enables analytics dataflow inspection and further supports provenance and debugging for data analytic applications, and iii) an integrated runtime query optimizer to generate optimal execution plan for data analytics queries and achieve superior performance in application areas that had posed major challenges for traditional Database technology.
In particular, our KDDLog system enables users to build or customize knowledge discovery models by concise and expressive language, via recursive queries with aggregates and our newly-proposed chain aggregates. We further provide specialized compilation techniques for semi-naive fix-point computation in the presence of aggregates, optimizations for complex recursive queries on distributed data platforms, KDDLib to build knowledge discovery tasks and advanced interfaces to assist users to port new knowledge discovery models. Following KDDLog, we present SEIZE, a unified framework that enables dataflow inspection---wiretapping the data-path of data analytics applications with listening logic. We generalize our lessons learned by providing a set of primitives defining dataflow inspection, orchestration options for different inspection granularities, and operator decomposition and dataflow punctuation strategy for dataflow intervention. Finally, we propose RIOS, a runtime integrated query optimizer for data analytics that lazily binds to execution plans at runtime, after collecting the statistics needed to make more optimal decisions. A specific focus in our design is to obtain accurate estimates on predicate (including UDF) selectivities for determining an optimal join order and physical join implementation, without incurring significant runtime overheads.