Today's heterogeneous architectures bring together multiple general purpose CPUs, domain specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance so that workload completion time is minimized. Operating system and application development for these systems are in their infancy.
In this dissertation, we propose various techniques to improve overall system throughput on heterogeneous systems. We develop run-time and compile-time mechanisms to efficiently distribute the workload between various processors and accelerators, transfer the corresponding data to execute them. We explore various data partitioning, synchronization and scheduling schemes to improve load balance, maximize resource utilization and minimize the execution time. First, we propose a dynamic scheduling mechanism to incorporate all available processing units in the execution of a given parallel loop. Our scheme automatically detects the computation speed of each CPU and accelerator and distributes the workload accordingly during run-time. We, then, focus on improving data transfers over PCI-e bus to further improve the system throughput in the existing of multiple applications sharing a single GPU. We present a framework to exploit automatic transfer/execution overlapping without requiring any modifications to source code.
We further improve heterogeneous system efficiency by optimizing in-GPU execution. We find that barrier synchronizations cause a bottleneck in performance. We develop a task based execution scheme that utilizes distributed queues to exploit better cache locality and inter-SM load balance. Finally, we introduce a tile-based wavefront execution technique by removing global barriers and employing a novel peer-SM synchronization mechanism. Through extensive experiments, we observe that our schemes significantly outperforms existing state-of-the-art approaches.