UC San Diego
Continuous MapReduce : an architecture for large-scale in-situ data processing
- Author(s): Trezzo, Christopher J.
- et al.
This thesis addresses a fundamental data management challenge faced by cloud service providers: analysis of semi-structured log data generated by large-scale compute infrastructure. This analysis is a crucial aspect of a cloud provider's business, creating competitive advantage by mining user behavioral patterns and ensuring efficient use of resources. However, the amount of data produced in this environment is rapidly growing. The current approach brings data to a central location before analysis, incurring a significant cost and delaying results. As scale increases, the time and cost of data migration alone will render this approach infeasible. We present Continuous MapReduce (CMR), an architecture for large- scale in-situ data processing. CMR is designed to be scalable, responsive, and available while processing logs across thousands of data center servers. CMR extends the MapReduce programming model to allow continuous queries over these data streams, building on concepts found in distributed stream processing. The salient architectural features include an in-situ approach, incremental processing with sliding windows, and a relaxed consistency model. We have built a prototype CMR framework using Mortar, a distributed stream processor, and evaluated it against current batch processing systems. Our results indicate that this approach can improve result latency by 30% for batch and continuous queries. In addition, CMR's consistency model enables it to return results quickly in the face of failure, and still maintain high result fidelity. These results indicate CMR is a valuable tool for addressing the scalability issues of next generation data processing environments