MPI is the de-facto standard message-passing based parallel programming model. However, the bug detection support for MPI applications is lacking. This thesis seeks to address the challenges of bug detection techniques for MPI applications. Specifically, it tackles two kinds of bugs: (1) general software bugs (e.g., segmentation faults, assertion violations, and infinite loops) that lead to abnormal execution termination or program hangs at small scale, i.e., when a program is executed with only a few processes and a small problem; and (2) scaling problems that manifest only at large scale, i.e., when a program is executed with a large number of processes or a large-sized problem.
To aid in the detection of general bugs, we developed COMPI as an automated bug detection tool. COMPI tackles two major challenges. First, it provides an automated testing framework for MPI programs --- it performs concolic execution on a single process and records branch coverage across all. Second, COMPI effectively controls the cost of testing as too high a cost may prevent its adoption or even make the testing infeasible.
Furthermore, we enhanced the usability of COMPI via addressing two issues: input values generated by COMPI do not deliver cost-effective testing, and COMPI does not support floating-point arithmetic and thus much code cannot be explored. We address the first issue via proposing a novel input tuning technique without requiring the intervention of users. We enable handling of floating point data types and operations and demonstrate that the efficiency of constraint solving can be improved if we rely on the use of reals instead of floating point values.
To tackle scaling problems, we provided a testing suite and designed an avoidance framework for scaling problems associated with the use of MPI collectives. To improve users' productivity, we establish the necessity of user side testing and provide a protection layer to avoid scaling problems non-intrusively, i.e., without requiring any changes to the MPI library or user programs. This provides an immediate remedy when an official fix is not readily available.
Finally, we built a hang detection tool that saves computing resources in the presence of program hangs at large scale. ParaStack is an extremely lightweight tool to detect hangs in a timely manner with high accuracy, in a scalable manner with negligible overhead, and without requiring the user to select a timeout value. For a detected hang, it tells users whether the hang is the result of an error in the computation phase or the communication phase. For a computation-error induced hang, our tool pinpoints the faulty process by excluding hundreds and thousands of other processes.