Government, medical, financial, and web-based services increasingly
depend on the ability to rapidly sift through huge, evolving data sets. These
data-intensive applications perform complex multi-step computations over
successive generations of data inflows (e.g., weekly web crawls, nightly
telescope dumps, or hourly surveillance videos). Because of the data volumes
involved, applications must avoid reprocessing old data when new data arrives
and instead process incrementally. Unlike in stream-based systems, incoming
data does not have to be processed immediately, permitting work to be amortized
via bulk processing. Such bulk-incremental processing represents an emerging
class of applications whose needs are not fully met by current systems. This
paper presents a generalized architecture for bulk-incremental processing
systems (BIPS), simplifying the creation of such programs. In contrast with
incremental view maintenance in data warehousing, BIPS provides flexible
low-level primitives for managing incremental data and processing, upon which
both relational and non-relational operations can be implemented. The paper
describes the BIPS programming model along with several example applications
and examines some key implementation choices. These choices are shown to play a
major role in overall system performance, via experiments on a large testbed
cluster.
Pre-2018 CSE ID: CS2009-0951