High-performance parallel storage systems, such as those used for high-performance computing and data centers, can suffer from performance degradation when a large number of clients are contending for limited resources, like bandwidth. This kind of contention is common among any storage systems that have to serve a large number of users or applications, and can lower the performance of the system and cause unpredictable speed variances. The performance degradation can cause significant resource waste for large storage systems.
This thesis describes the Automatic Storage Contention Alleviation and Reduction system (ASCAR), a storage traffic management system for improving the bandwidth utilization and fairness of resource allocation. ASCAR is a fully autonomous software system. It requires no change to the hardware or system design, and integrates well with existing systems. On the high level, ASCAR measures the system's and workload's running states and tunes one or more parameters in order to push a user designated performance metric higher. The metric can be any measurable properties of the system or the workloads, such as I/O throughput, latency, or application runtime.
ASCAR includes two sets of algorithms for different tuning requirements. The first method is rule-based. Each client's control agent regulates the traffic independently according to a preloaded rule set. Rule-based client controllers are fast responding to burst I/O because no runtime coordination between clients or with a central coordinator is needed; they are also autonomous so the system has no scale-out bottleneck. Finding optimal rules can be a challenging task that requires expertise and numerous experiments. ASCAR includes the SHAred-nothing Rule Producer (SHARP) that produces and refines control rules iteratively without the need of human supervision. SHARP systematically explores the solution space of possible rule designs and evaluating the target workload under the candidate rule sets.
The second method uses a neural network-based reinforcement learning method called Q-learning to perform continual analyzing of the states of the system and workload, and to tune the values of the traffic control parameters. This method is named CAPES,Computer Automated Performance Enhancement System. Deep Q-Learning (DQL) is an unsupervised machine learning method that requires no prior knowledge of the system or workloads, does not need existing dataset for training, and performs well on diverse input data featuring long delays between action and reward. Most complex storage systems show such a property: there is usually a long delay between setting a traffic control parameter and the change in traffic metrics. A multilayered deep neural network is chosen as DQL's value function, and experience replay is used to mitigate overfitting.
SHARP and CAPES are synergistic and cover different tuning requirements. SHARP is best for relatively stable workloads, requires no runtime communication between agents, and therefore can easily scale to support very large storage systems. CAPES is best for tuning unpredictable workloads and requires communication between monitoring and control agents.
Evaluation of SHARP and CAPES are done on the Lustre parallel file system. Lustre distributes I/O requests to many servers in parallel in order to reach high performance, and can multiply the number of application I/O requests, causing contention throughout the system. SHARP and CAPES are both effective at improving the throughput of the test workloads during the evaluation, some by as much as 45%.