The growing importance of data science applications has motivated great research interest in powerful languages and scalable systems for supporting advanced analytics on massive data sets. Languages such as R and Scala are used to develop advanced analytical applications that are not supported by SQL, the traditional query language used for decades to search the database and analyze its data. An interesting research question that arises in this scenario is whether it is possible to design an efficient query language that simplifies the writing of advanced analytical applications and provides a unified environment for their development and deployment on multiple platforms, including massively parallel ones. In this thesis, we provide a positive answer to this question by demonstrating extensions of the logic-based query language Datalog and their implementation techniques to enable (i) scalable support for graph analytics and knowledge discovery applications, and (ii) portability between multicore machines and clusters.
A first set of extensions discussed in this thesis is based on monotonic aggregates and led to the implementation of our Deductive Application Language (DeAL) system which (i) achieves superior performance for graph analytics applications compared with other Datalog systems on multicore machines, and (ii) outperforms other distributed Datalog systems, as well as both GraphX and native Apache Spark. We then tackle the difficult problem of supporting knowledge discovery applications, by introducing non-monotonic extensions to support generic user-defined aggregates, for which we provide a formal logic-based semantics. The Knowledge Discovery in Datalog (KDDlog) language so derived can express efficiently both descriptive analytics, such as rollups and data cubes, and predictive analytics, such as association rule mining, classification, regression analysis, and cluster analysis.