Extending Mapreduce for Scientific Data
In the last decade, our ability to store data has grown at a greater rate than our ability to process that data via existing methods. This has given rise to an entire industry focused on developing new approaches to large-scale data-intensive computing that is commonly referred to as "big data".
The scientific community has seen a similar growth in the sizes of the datasets that they must analyze in order to advance their respective fields. Unfortunately, the unique characteristics of scientific data and computations are sufficiently distinct from those that garner the focus of the larger "big data" community that the direct application of existing systems is not feasible. The absence of a suitable platform for scalable data-intensive scientific computing has already become a hindrance to some scientific communities and the negative impact on research will only increase as datasets continue to grow.
In response to this clear and present need, we present our research into designing and building a distributed computing platform appropriate for researchers in a variety of fields that are continually evolving their approaches to processing scientific data. Our work extends Hadoop, an existing open source system based on the research described in the seminal 2004 MapReduce paper, in order to benefit from the existing state of the art. We first identify the barriers to executing common scientific computations over data residing in domain-specific file formats. Once those issues are addressed, we then consider how the unique properties of scientific data and computations can be leveraged to improve the system. The resulting alterations are then evaluated, both in terms of their impact to the MapReduce design as well as their impact on the performance characteristics of the system. The culmination of our work is a viable platform for scientific computing that has been utilized and validated by external researchers.