Skip to main content
eScholarship
Open Access Publications from the University of California

UC Merced

UC Merced Electronic Theses and Dissertations bannerUC Merced

Characterization and Modeling of Error Resilience in HPC Applications

Creative Commons 'BY' version 4.0 license
Abstract

HPC systems are widely used in industrial, economical, and scientific applications, and many of these applications are safety- and time-critical. We must ensure that the application execution is reliable, and the scientific simulation outcome is trustworthy. As HPC systems continue to increase computational power and size, next-generation HPC systems are expected to incur a higher failure rate than contemporary systems. How to ensure scientific computing integrity in the presence of an increasing number of system faults is one of the grand challenges (also known as the resilience challenge) for large-scale HPC systems.

This dissertation focuses on characterizing, modeling, developing, and advancing resilience strategies and tools in HPC systems to allow scientific applications to survive system failures better. In particular, in this dissertation we systematically characterize HPC applications to find reasons accounting for nature error resilience of HPC applications by tracking error propagation and also by capturing application properties according to their significance to application error resilience using machine learning. We further model application error resilience at different granularities, including individual data objects, small computation kernels, and the whole application. Also, we develop an error resilience benchmark suite to comprehensively evaluate and comparatively study different error resilience designs in the presence of MPI process or node failures. With the knowledge learned from characterization and modeling of application error resilience, we propose a collection of new methodologies and tools that can guide HPC practitioners to find the most effective and efficient error resilience designs, provide helps to advance effectiveness and efficiency of the existing error resilience designs, and build inspiration foundations to future error resilience designs aiming at higher effectiveness and efficiency of HPC systems.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View