Programming Bulk-Incremental Dataflows
Skip to main content
eScholarship
Open Access Publications from the University of California

Programming Bulk-Incremental Dataflows

Abstract

Government, medical, financial, and web-based services increasingly depend on the ability to rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows (e.g., weekly web crawls, nightly telescope dumps, or hourly surveillance videos). Because of the data volumes involved, applications must avoid reprocessing old data when new data arrives and instead process incrementally. Unlike in stream-based systems, incoming data does not have to be processed immediately, permitting work to be amortized via bulk processing. Such bulk-incremental processing represents an emerging class of applications whose needs are not fully met by current systems. This paper presents a generalized architecture for bulk-incremental processing systems (BIPS), simplifying the creation of such programs. In contrast with incremental view maintenance in data warehousing, BIPS provides flexible low-level primitives for managing incremental data and processing, upon which both relational and non-relational operations can be implemented. The paper describes the BIPS programming model along with several example applications and examines some key implementation choices. These choices are shown to play a major role in overall system performance, via experiments on a large testbed cluster.

Pre-2018 CSE ID: CS2009-0944

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View