In exascale computing era, applications are executed at larger scale than ever before, whichresults in higher requirement of scalability for communication library design. Message Pass-
ing Interface (MPI) is widely adopted by the parallel application nowadays for interprocess
communication, and the performance of the communication can significantly impact the
overall performance of applications especially at large scale.
There are many aspects of MPI communication that need to be explored for the
maximal message rate and network throughput. Considering load balance, communication
load balance is essential for high-performance applications. Unbalanced communication can
cause severe performance degradation, even in computation-balanced Bulk Synchronous
Parallel (BSP) applications. MPI communication imbalance issue is not well investigated
like computation load balance. Since the communication is not fully controlled by application
developers, designing communication-balanced applications is challenging because of
the diverse communication implementations at the underlying runtime system.
In addition, MPI provides nonblocking point-to-point and one-sided communication
models where asynchronous progress is required to guarantee the completion of MPI
communications and achieve better communication and computation overlap. Traditional
mechanisms either spawn an additional background thread on each MPI process or launch
a fixed number of helper processes on each node. For complex multiphase applications,
unfortunately, severe performance degradation may occur due to dynamically changing
communication characteristics.
On the other hand, as the number of CPU cores and nodes adopted by the applications
greatly increases, even the small message size MPI collectives can result in the
huge communication overhead at large scale if they are not carefully designed. There are
MPI collective algorithms that have been hierarchically designed to saturate inter-node
network bandwidth for the maximal communication performance. Meanwhile, advanced
shared memory techniques such as XPMEM, KNEM and CMA are adopted to accelerate
intra-node MPI collective communication. Unfortunately, these studies mainly focus on
large-message collective optimization which leaves small- and medium-message MPI collectives
suboptimal. In addition, they are not able to achieve the optimal performance due to
the limitations of the shared memory techniques.
To solve these issues, we first present CAB-MPI, an MPI implementation that can
identify idle processes inside MPI and use these idle resources to dynamically balance communication
workload on the node. We design throughput-optimized strategies to ensure
efficient stealing of the data movement tasks. The experimental results show the benefits
of CAB-MPI through several internal processes in MPI, including intranode data transfer,
pack/unpack for noncontiguous communication, and computation in one-sided accumulates
through a set of microbenchmarks and proxy applications on Intel Xeon and Xeon Phi plat-
forms. Then, we propose a novel Dynamic Asynchronous Progress Stealing model (Daps)
to completely address the asynchronous progress complication; Daps is implemented inside
the MPI runtime, and it dynamically leverages idle MPI processes to steal communication
progress tasks from other busy computing processes located on the same node. We compare
Daps with state-of-the-art asynchronous progress approaches by utilizing both microbenchmarks
and HPC proxy applications, and the results show the Daps can outperform the
baselines and achieve less idleness during asynchronous communication. Finally, to further
improve MPI collectives performance, we propose Process-in-Process based Multiobject
Interprocess MPI Collective (PiP-MColl) design to maximize small and medium-message
MPI collective performance at a large scale. Different from previous studies, PiP-MColl
is designed with efficient multiple senders and receivers collective algorithms and adopts
Process-in-Process shared memory technique to avoid unnecessary system call and page
fault overhead to achieve the best intra- and inter-node message rate and throughput. We
focus on three widely used MPI collectives MPI Scatter, MPI Allgather and MPI_Allreduce
and apply PiP-MColl to them. Our microbenchmark and real-world HPC application experimental
results show PiP-MColl can significantly improve the collective performance at
a large scale compared with baseline PiP-MPICH and other widely used MPI libraries such
as OpenMPI, MVAPICH2 and Intel MPI.