Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Efficient Execution of Scientific Applications on Heterogeneous Architectures

Abstract

Today's heterogeneous architectures bring together multiple general purpose CPUs, domain specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance so that workload completion time is minimized. Operating system and application development for these systems are in their infancy.

In this dissertation, we propose various techniques to improve overall system throughput on heterogeneous systems. We develop run-time and compile-time mechanisms to efficiently distribute the workload between various processors and accelerators, transfer the corresponding data to execute them. We explore various data partitioning, synchronization and scheduling schemes to improve load balance, maximize resource utilization and minimize the execution time. First, we propose a dynamic scheduling mechanism to incorporate all available processing units in the execution of a given parallel loop. Our scheme automatically detects the computation speed of each CPU and accelerator and distributes the workload accordingly during run-time. We, then, focus on improving data transfers over PCI-e bus to further improve the system throughput in the existing of multiple applications sharing a single GPU. We present a framework to exploit automatic transfer/execution overlapping without requiring any modifications to source code.

We further improve heterogeneous system efficiency by optimizing in-GPU execution. We find that barrier synchronizations cause a bottleneck in performance. We develop a task based execution scheme that utilizes distributed queues to exploit better cache locality and inter-SM load balance. Finally, we introduce a tile-based wavefront execution technique by removing global barriers and employing a novel peer-SM synchronization mechanism. Through extensive experiments, we observe that our schemes significantly outperforms existing state-of-the-art approaches.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View