POURGHASSEMI, BEHNAM

cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications

2017

POURGHASSEMI, BEHNAM
Advisor(s): Chandramowlishwaran, Aparna

Creative Commons 'BY' version 4.0 license

Abstract

Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although there exist several checkpoint/restart implementations for CPUs, only a handful have been proposed for GPUs even though more than 60 supercomputers in the TOP 500 list are heterogeneous CPU-GPU systems.

In this work, we propose a scalable application-level checkpoint/restart scheme, called cudaCR for long-running kernels on NVIDIA GPUs. Our proposed scheme is able to capture GPU state inside the kernel and roll back to the previous state within the same kernel, unlike state-of-the-art approaches. This thesis presents cudaCR implementation in detail and evaluate the first version of that on application benchmarks with different characteristics such as dense matrix multiply, stencil computation, and k-means clustering on a Tesla K40 GPU. We observe that cudaCR can fully restore state with low overheads in both time (less than 10\% in best case) and memory requirements after applying a number of different optimizations (storage gain: 54\% for dense matrix multiply, 31\% for k-means, and 4\% for stencil computation). Looking forward, we identify new optimizations to further reduce the overhead to make cudaCR highly scalable.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Irvine

cudaCR: An In-kernel Application-level Checkpoint/Restart Scheme for CUDA Applications