In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision makings and applications. Scaling up analysis, possibly including the application of custom machine learning models, to large volumes of data, requires the utilization of distributed frameworks which can introduce serious technical challenges to data analysts and reduce their productivity.
In order to efficiently support the full Big Data analysis lifecycle without requiring extensive distributed systems knowledge, we extend data scientists’ familiar tool, Pandas dataframe, to operate on managed data at scale. We introduce AFrame, a new scalable analysis package that integrates a Pandas-like user experience with data management systems to provide analysts with a familiar working environment while scaling out the evaluation of the analytical operations over a large data cluster to enable analysis on large-scale managed datasets.
There are four aspects involved in this dissertation: The first is constructing a new framework ("AFrame"). We have implemented AFrame on top of Apache AsterixDB by transparently converting dataframe operations to SQL++ queries. The second aspect is making AFrame more flexible for deployment with other composable query languages by retargeting AFrame’s incremental query formation to other query-based database systems. The third aspect is creating a benchmark to evaluate our framework's performance. The fourth and final aspect is to demonstrate the feasibility and efficacy of our framework through a case study analysis.