Search

Scholarly Works (1 results)

Article
Peer Reviewed

Programming Bulk-Incremental Dataflows

Technical Reports (2009)

Government, medical, financial, and web-based services increasingly depend on the ability to rapidly sift through huge, evolving data sets. These data-intensive applications perform complex multi-step computations over successive generations of data inflows (e.g., weekly web crawls, nightly telescope dumps, or hourly surveillance videos). Because of the data volumes involved, applications must avoid reprocessing old data when new data arrives and instead process incrementally. Unlike in stream-based systems, incoming data does not have to be processed immediately, permitting work to be amortized via bulk processing. Such bulk-incremental processing represents an emerging class of applications whose needs are not fully met by current systems. This paper presents a generalized architecture for bulk-incremental processing systems (BIPS), simplifying the creation of such programs. In contrast with incremental view maintenance in data warehousing, BIPS provides flexible low-level primitives for managing incremental data and processing, upon which both relational and non-relational operations can be implemented. The paper describes the BIPS programming model along with several example applications and examines some key implementation choices. These choices are shown to play a major role in overall system performance, via experiments on a large testbed cluster.

Pre-2018 CSE ID: CS2009-0951

Cover page: Programming Bulk-Incremental Dataflows