Fault Tolerance for Iterative Methods in High-Performance Computing
Iterative methods are commonly used approaches to solve large, sparse linear systems, which are fundamental operations for many modern scientific simulations. When the large-scale iterative methods are running with a large number of ranks in parallel, they are anticipated to be more susceptible to soft errors in both logic circuits and memory subsystems and fail-stop errors in the entire system, considering large component counts and lower power margins of emerging high-performance computing (HPC) platforms.
To protect iterative methods from soft errors, we propose an online algorithm-based fault tolerance (ABFT) approach to efficiently detect and recover soft errors for iterative methods. We design a novel checksum-based encoding scheme for matrix-vector multiplication that is resilient to both arithmetic and memory errors. Our design decouples the checksum updating process from the actual computation and allows adaptive checksum overhead control. Building on this new encoding mechanism, we propose two online ABFT designs that can effectively recover from errors when combined with a checkpoint/rollback scheme. These designs are capable of addressing scenarios under different error rates. Our ABFT approaches apply to a wide range of iterative solvers that primarily rely on matrix-vector multiplication and vector linear operations. We evaluate our designs through comprehensive analytical and empirical analysis. Experimental evaluation on the Stampede supercomputer demonstrates the low-performance overheads incurred by our two ABFT schemes for preconditioned CG (0.4% and 2.2%) and preconditioned BiCGSTAB (1.0% and 4.0%) for the largest SPD matrix from UFL Sparse Matrix Collection. The evaluation also demonstrates the flexibility and effectiveness of our proposed designs for detecting and recovering various types of soft errors in iterative methods.
Iterative methods have to checkpoint the dynamic variables periodically in case of unavoidable fail-stop errors, requiring fast I/O systems and large storage space. Thus, significantly reducing the data to be checkpointed is critical to improving the overall performance of iterative methods. Lossy compression allowing user-controlled data loss can significantly reduce the I/O burden. To this end, we design a new error-controlled lossy compression algorithm for large-scale scientific data. We significantly improve the prediction accuracy for each data point based on its nearby data values along multiple dimensions. We derive a series of multilayer prediction formulas and their unified formula in the context of data compression. One serious challenge is that the data prediction has to be performed based on the preceding decompressed values during the compression in order to guarantee the error bounds, which may degrade the prediction accuracy in turn. We explore the best layer for the prediction by considering the impact of compression errors on the prediction accuracy and propose an adaptive error-controlled quantization encoder, which can further improve the prediction accuracy considerably. The data size can be reduced significantly after performing the variable-length encoding because of the uneven distribution produced by our quantization encoder. We evaluate the new compressor on production scientific data sets and compare it with many other state-of-the-art compressors. Experiments show that our compressor is the best in class, especially with regard to compression ratios and compression errors. Our solution is better than the second-best solution by more than a 2x increase in the compression ratio and 3.8x reduction in the normalized root mean squared error on average, with reasonable error bounds and user-desired bit-rates.
Finally, we design a novel lossy checkpointing scheme that can significantly improve the checkpointing performance of iterative methods by leveraging our proposed lossy compression to tolerate failure-stop errors. We formulate a lossy checkpointing performance model and derive theoretically an upper bound for the extra number of iterations caused by the distortion of data in lossy checkpoints, in order to guarantee the performance improvement under the lossy checkpointing scheme. We analyze the impact of lossy checkpointing (i.e., the extra number of iterations caused by lossy checkpointing files) for multiple types of iterative methods. We evaluate the lossy checkpointing scheme with optimal checkpointing intervals on a high-performance computing environment with 2,048 cores, using a well-known scientific computation package PETSc and a state-of-the-art checkpoint/restart toolkit. Experiments show that our optimized lossy checkpointing scheme can significantly reduce the fault tolerance overhead for iterative methods by 23%∼70% compared with traditional checkpointing and 20%∼58% compared with lossless-compressed checkpointing, in the presence of system failures.