Davies, Teresa

Checksum-Based Fault Tolerance for LU Factorization

2014

Davies, Teresa
Advisor(s): Chen, Zizhong

Abstract

In high-performance systems, the probability of failure is higher for larger systems. The probability that a failure will occur before the end of the computation increases as the number of processors used in a high performance computing application increases. For long running applications using a large number of processors, it is essential that fault tolerance be used to prevent a total loss of all finished computations after a failure. There are two main classes of errors: errors involving loss of data, and errors involving corruption of data. A fail-stop failure, where a process is lost along with its data, can be handled for any application with checkpointing. While checkpointing has been very useful to tolerate failures for a long time, it often introduces a considerable overhead especially when applications modify a large amount of memory between checkpoints and the number of processors is large. Therefore an application-specific approach to handling fail-stop failures allows fault tolerance with much lower overhead. Errors in calculations that cannot be detected by any other means are called soft errors, and usually are errors in the data that cause the program to deliver the wrong result at the end, but do not have any other effect that would indicate an error occurred. It has been proved that, if LU factorization is used to factor a checksum matrix, the checksum relationship will be preserved at the end of the computation. For some LU factorization algorithms, the checksum relationship in the input checksum matrix is not maintained in the middle of the computation. However, in the right-looking LU factorization algorithm, the checksum relationship will be maintained at each step. Based on this checksum relationship, errors in the LU factorization can be detected and corrected before they are propagated. The frequency of checking can be adjusted for the error rate, resulting in a flexible method of fault tolerance. Experimental results on the supercomputer Jaguar demonstrate that the fault tolerance overhead introduced by the proposed recovery scheme is minimal. These techniques are guaranteed to handle at least one error. In certain cases, multiple simultaneous errors can be handled, but it is not guaranteed. If multiple errors are likely, multiple checksums can be used to handle more than one simultaneous error. Multiple checksums can be created using coefficients to calculate different sums from the same set of elements. Since the calculations to recover from an error use real numbers, it is extremely important that the coefficients be chosen to minimize the error created by a recovery, and this is the main challenge for adding the ability to recover from multiple errors.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Riverside

Checksum-Based Fault Tolerance for LU Factorization