UC San Diego
Newt : an architecture for lineage -based replay and debugging in DISC systems
- Author(s): De, Soumyarupa
- et al.
Data-intensive scalable computing (DISC) systems facilitate large-scale analytics to mine "big data" for useful information. However, understanding and debugging these systems and analytics is a fundamental challenge to their continued use. This thesis presents Newt, a scalable architecture for capturing fine-grain lineage from DISC systems and using this information to analyze and debug analytics. Newt provides a unique instrumentation API, which actively extracts fine-grain lineage across complex, non-relational analytics. Newt combines this API with a scalable architecture for storing lineage to accommodate the high throughputs of DISC systems. This architecture enables efficient dataflow tracing queries across thousands of operators found in modern data analytics. Newt extends tracing with replay, enabling users to perform step-wise debugging or regenerate lost outputs at a fraction of the cost to execute the entire analytics. Newt further facilitates replay for re-executing analytics without bad inputs to produce error-free outputs. Finally, Newt also enables retrospective lineage analysis, which we use to identify errors in the dataflow using outlier detection techniques. We illustrate the flexibility of Newt's capture API by instrumenting two DISC systems: Apache Hadoop and Hyracks. This API incurs 10-51% time overhead and 30-120% space overhead on workloads consisting of relational and non-relational operators, including a Hadoop-based de novo genomic assembler. Newt can also accurately replay selected outputs, which can reduce the time to recreate errors during debugging. We show that it incurs 0.3% of the original runtime when replaying individual outputs in a WordCount workload. Finally, this work shows the effectiveness of Newt's debugging methodology by pinpointing faulty operators in a dataflow