TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis
At present, I/O performance analysis requires different tools to characterize individual components of the I/O subsystem, and institutional I/O expertise is relied upon to translate these disparate data into an integrated view of application performance. This process is labor-intensive and not sustainable as the storage hierarchy deepens and system complexity increases. To address this growing disparity, we have developed the Total Knowledge of I/O (TOKIO) framework to combine the insights from existing component-level monitoring tools and provide a holistic view of performance across the entire I/O stack. A reference implementation of TOKIO, pytokio, is presented here. Using monitoring tools included with Cray XC and ClusterStor systems alongside commonly deployed community-supported tools, we demonstrate how pytokio provides a lightweight foundation for holistic I/O performance analyses on two Cray XC systems deployed at different HPC centers. We present results from integrated analyses that allow users to quantify the degree of I/O contention that affected their jobs and probabilistically identify unhealthy storage devices that impacted their performance.We also apply pytokio to inspect the utilization of NERSC’s DataWarp burst buffer and demonstrate how pytokio can be used to identify users and applications who may stand to benefit most from migrating their workloads from Lustre to the burst buffer.