Faced with the exponential growth in computing requirements, programmable hardware accelerators, such as GPUs and FPGAs, are becoming increasingly popular in high performance computing systems. In deference to energy efficiency and scalability challenges in these systems, it is crucial to efficiently use hardware resources while maintaining their reliability requirements. To meet system reliability requirements, traditional methods add redundancy in hardware or software. However, these redundancy-based error mitigation techniques suffer from inefficient use of hardware resources. The goal in this dissertation is to devise low-overhead approaches to mitigate the fault-susceptibility of hardware accelerators, and use their available resources efficiently.
For fault-susceptibility mitigation in GPU accelerators, this dissertation proposes a software-based approach that enables isolation of faulty components through task migration. Due to lack of configurable scheduler for GPUs, the proposed solution makes use of introspective kernels to enable effective task migration for isolating faulty components. This technique has very low overhead in terms of performance and energy and improves the accelerator lifetime and overall system cost. For FPGA accelerators, faulty component isolation is handled with a directive-based method through the synthesis tool.
This dissertation presents practical optimization methods to efficiently use the available resources on programmable hardware accelerators. These optimizations are performed at different levels of abstractions that are useful for GPUs and FPGAs, and the trade-offs among them are elaborated. For GPUs, optimization opportunities are explored in hardware-level and source-level. For FPGAs, optimizations are studied at the compiler-level, source-level, and algorithm-level. These optimization methods seek to remove unnecessary redundancies from program or hardware. This dissertation demonstrates practical and efficient approaches for utilizing fault-susceptible programmable hardware accelerators and improving their efficiency in terms of both cost per performance and energy.