Kang, Qiao; Agrawal, Ankit; Choudhary, Alok; Sim, Alex; Wu, Kesheng; Kettimuthu, Rajkumar; Beckman, Peter H; Liu, Zhengchun; Liao, Wei-keng

doi:10.1109/bigdata47090.2019.9006046

Download PDF

Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems

2019

Published Web Location

https://doi.org/10.1109/bigdata47090.2019.9006046

Abstract

The demands of increasingly large scientific application workflows lead to the need for more powerful supercomputers. As the scale of supercomputing systems have grown, the prediction of fault tolerance has become an increasingly critical area of study, since the prediction of system failures can improve performance by saving checkpoints in advance. We propose a real-time failure detection algorithm that adopts an event-based prediction model. The prediction model is a convolutional neural network that utilizes both traditional event attributes and additional spatio-temporal features. We present a case study using our proposed method with six years of reliability, availability, and serviceability event logs recorded by Mira, a Blue Gene/Q supercomputer at Argonne National Laboratory. In the case study, we have shown that our failure prediction model is not limited to predict the occurrence of failures in general. It is capable of accurately detecting specific types of critical failures such as coolant and power problems within reasonable lead time ranges. Our case study shows that the proposed method can achieve a F1 score of 0.56 for general failures, 0.97 for coolant failures, and 0.86 for power failures.

Main Content

For improved accessibility of PDF content, download the file to your device.

Computing Sciences

Spatiotemporal Real-Time Anomaly Detection for Supercomputing Systems

Published Web Location