Scalable Lineage Capture for Debugging DISC Analytics
Skip to main content
eScholarship
Open Access Publications from the University of California

Scalable Lineage Capture for Debugging DISC Analytics

Abstract

A fundamental challenge for big-data analytics is how to efficiently tune and debug multi-step dataflows. This paper presents Newt, a scalable architecture for capturing and using record-level data lineage to discover and resolve errors in analytics. Newt's flexible instrumentation allows system developers to collect this fine-grain lineage from a range of data intensive scalable computing (DISC) architectures, actively recording the flow of data through multi-step, user-defined transformations. Newt pairs this API with a scale-out, fault-tolerant lineage store and query engine. We find that while active collection can be expensive, real-world analytics often incur modest runtime overheads (<36%) and it enables novel lineage-based debugging techniques. For instance, Newt can efficiently recreate errors (crashes or bad outputs) or remove input data from the dataflow to enable data cleaning strategies. Additionally, Newt's active lineage collection allows retrospective analyses of a dataflow's behavior, such as identifying anomalous stages. As case studies, we instrument two DISC systems, Hadoop and Hyracks, with less than 105 lines of additional code for each. Finally, we use Newt to systematically clean input data to a Hadoop-based de novo genome assembler, significantly improving the quality of the output assembly.

Pre-2018 CSE ID: CS2012-0990

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View