Spatio-temporal analysis of HPC I/O and connection data
Published Web Locationhttps://sdm.lbl.gov/oapapers/snta18-kim.pdf
The HPC system consists of a set of layers of software and hardware for I/O and networking. System logs are helpful resources to understand what is going on in the system. A challenge is that it is non-trivial to analyze the logs maintained in various levels of the stack. Independent analysis might lead to an incomplete conclusion due to the limited coverage of each log. This work takes a comprehensive approach to analysis that incorporates the logs in the multiple layers and components, in order to facilitate the detection of anomalous activities. This research aims to identify and predict potential performance bottlenecks in the HPC system, by capturing the temporal variation patterns from heterogeneous, high-dimensional, and non-linear log data. In this paper, we share our preliminary efforts for spatio-temporal analysis of HPC I/O and connection data, with our initial observations from the analysis of one-week HPC log data sets collected from one of NERSC systems.