Runtime resource management for heterogeneous computing systems is becoming more and more complex as workloads in these platforms get increasingly more diverse and the conflicts grow between heterogeneous architectural components and their resource demands. The goal of these runtime resource management mechanisms is to achieve the overall system goal for dynamic workloads while coordinating system resources in a robust and adaptive fashion.
To address the complexities in heterogeneous computing systems, state-of-the-art techniques that use heuristics or machine learning have been proposed. On the other hand, conventional control theory can be used for formal guarantees, but may face unmanageable complexity for modeling system dynamics when dealing with heterogeneous computing platforms.
In this thesis, we initially analyze a variety of runtime resource management methods and introduce a classification for these methods capturing the utilized resources and metrics. We cover heuristic, machine learning and control theory methods used to manage resources such as performance, power, energy, temperature, Quality-of-Service (QoS) and reliability of the system.
In addition, we explore a variety of dynamic resource management frameworks that provide significant gains in terms of self-optimization and self-adaptivity. This includes simulation infrastructures, hardware platforms enhanced with multi-layer management mechanisms and corresponding software frameworks that enable management policies for these systems in an effective and adaptive manner.
Ultimately, we address the problem of optimizing energy efficiency, power consumption, performance and QoS in heterogeneous systems by proposing adaptive runtime policies. The proposed methods in this thesis, take into account the constraints and requirements defined by user, dynamic workloads and coordination between conflicting objectives.
The projects presented in this dissertation show effectiveness in responding to abrupt changes in heterogeneous computing systems by dynamically adapting to changing application and system behavior at runtime, and are thus able to provide significant improvement compared to commonly used static resource management methods.