Skip to main content
eScholarship
Open Access Publications from the University of California

Optimizing Interactive Development of Data-Intensive Applications.

  • Author(s): Interlandi, Matteo
  • Tetali, Sai Deep
  • Gulzar, Muhammad Ali
  • Noor, Joseph
  • Condie, Tyson
  • Kim, Miryung
  • Millstein, Todd
  • Editor(s): Aguilera, Marcos K
  • Cooper, Brian
  • Diao, Yanlei
  • et al.
Abstract

Modern Data-Intensive Scalable Computing (DISC) systems are designed to process data through batch jobs that execute programs (e.g., queries) compiled from a high-level language. These programs are often developed interactively by posing ad-hoc queries over the base data until a desired result is generated. We observe that there can be significant overlap in the structure of these queries used to derive the final program. Yet, each successive execution of a slightly modified query is performed anew, which can significantly increase the development cycle. Vega is an Apache Spark framework that we have implemented for optimizing a series of similar Spark programs, likely originating from a development or exploratory data analysis session. Spark developers (e.g., data scientists) can leverage Vega to significantly reduce the amount of time it takes to re-execute a modified Spark program, reducing the overall time to market for their Big Data applications.

Many UC-authored scholarly publications are freely available on this site because of the UC's open access policies. Let us know how this access is important for you.

Main Content
Current View