UC San Diego
Approximate Computing for GPGPU Acceleration
- Author(s): Peroni, Daniel Nikolai
- Advisor(s): Rosing, Tajana S
- et al.
Faster and more efficient hardware is needed to handle the rapid growth of Big Data processing. Applications such as multimedia, medical analysis, computer vision, and machine learning can be parallelized and accelerated using General-Purpose Computing on Graphics Processing Units (GPGPUs). However, these are power intensive and novel approaches are needed to improve their efficiency. Many of applications also show a tolerance to noise within their computation. Approximate computing is a design strategy in which energy savings and speedup can be achieved at the expense of accuracy. If carefully controlled, many applications can accept small amounts of error and still produce acceptable results. This thesis proposes a number of methods to enable approximate computing for GPUs.
We first examine a number of approaches for approximating operations at the core level. Floating point arithmetic, specifically multiplies, make up the majority of instructions computed on GPUs. In this dissertation we propose a configurable floating point unit (CFPU) which eliminates the costly manitassa multiply by copying one of the input mantissa directly to the output. For applications with a higher amount of temporal similiarity we propose adaptive lookup (ALook) to use small dynamic look up tables to store recently computed operations. This low power look up table provides nearest distance matches to provide results rather than computing on the exact hardware.
GPUs issue threads in groups, commonly 32, called warps. Cores in a warp run the same instructions in lock-step. Every instruction within a warp must be accelerated to provide performance improvements. To control accuracy, we run the most erroneous approximate results on the exact hardware. Bottlenecks can arise as some threads spend time computing exact results while others use approximate solutions. We propose two methods to handle this problem. First, we use warp pass through to target warps in which a very small fraction of threads must be computed exactly. To handle warps with a larger percentage of exact computations, we utilize warp value trading (WVT). Under WVT, operations are traded between warps running on the same multiprocessor to create uniform groups of either exact or approximate operations.
Finally, we focus on application specific approximation. We show approximation can be used to accelerate neural networks during training and inference. Early stages of training tolerate more error than later ones, so we adjust the level of approximation over time. To accelerate inference we approximate larger operations to a lesser degree than larger ones to increase hit rate. For training we show that gradually reducing the maximum allowed error per operation results in 7.13x EDP improvement and 4.64x speedup training of four different neural network applications with less than 2% quality loss. For inference we are able to automatically select parameters based on user prediction requirements for neural networks and improves speedup by 2.9x speedup and EDP by 6.2x of inference across six neural networks.