UC San Diego
Bamboo : Automatic Translation of MPI Source into a Latency-Tolerant Form
- Author(s): Nguyen Thanh, Nhat Tan
- et al.
Communication remains a significant barrier to scalability on distributed-memory systems. At present, the trend in architectural system design, which focuses on enhancing node performance, exacerbates the communication problem, since the relative cost of communication grows as the computation rate increases. This problem will be more pronounced at the exascale, where computational rates will be orders of magnitude faster than that of the current technology. Communication overlap is an efficient method to hide communication by masking it behind computation. However, existing overlapping techniques not only require significant programming effort but also complicate the original program. This dissertation presents a source-to- source translation framework that can realize communication overlap in applications written in MPI, a standard library for distributed-memory programming, without the need to intrusively modify the source code. We explore a strategy based on re-interpreting MPI, which executes the application under a data-driven model that can hide communication overheads automatically. We reformulate MPI source into a task dependency graph representation, in which vertices represent tasks containing computation code and edges represent data dependencies among tasks. The task dependency graph maintains a partial ordering over the execution of tasks, enabling the program to execute in a data-driven fashion under the guidance of an external runtime system. To automate the code translation process, we develop Bamboo, a source-to-source translator. Bamboo supports a rich set of MPI routines, including point-to-point, collective, and communicator splitting operations. We show that, for a variety of applications, Bamboo is able to hide communication overheads on a wide range of platforms including traditional clusters of multicore processors, as well as platforms based on accelerators (NVIDIA GPUs) and coprocessors (Intel MIC). Specifically, we translate applications taken from three different application motifs : dense linear algebra, structured and unstructured grids. In all cases, Bamboo significantly reduces communication delays while requiring only modest amounts of programmer annotation. The performance of applications translated with Bamboo meets or exceeds that of labor-intensive hand coding. The translator is more than a means of hiding communication costs automatically; it also serves as an example of the utility of semantic level optimization against a well-known library