- Main
A Novel Data Structure and Anomaly Detection for Time Series Data
- Tafazoli, Sadaf
- Advisor(s): Keogh, Eamonn E
Abstract
Time series data is one of the most extensively examined forms of data. According to a recent KDnuggets poll, 48% of analysts have engaged in time series data analysis within the past few years. This places time series data analysis second, surpassed only by relational (table) data analysis and preceding the analysis of text, images, spatial, and social network data [49]. The Matrix Profile is a data structure that enhances time series analysis by recording the Euclidean distance between each subsequence and its nearest neighbor. Leveraging the Matrix Profile, analysts can uncover numerous valuable insights about time series data, such as anomaly detection, chain discovery, motif discovery, and more. However, the Matrix Profile is limited to representing the relationship between the subsequence’s shapes. It is known that, for some domains, useful information is conserved not in the subsequence’s shapes, but in the subsequence’s features. In recent years a new set of features for time series called catch22 has revolutionized feature-based mining of time series. Combining these two ideas seems to offer many possibilities for novel data mining applications, however, there are two difficulties in attempting this. A direct application of the Matrix Profile with the catch22 features would be prohibitively slow. Less obviously, as we will demonstrate, in almost all domains, using all twenty-two of the catch22 features produces poor results, and we must somehow select the subset appropriate for the domain. In this work we introduce novel data structure to solve both problems and demonstrate that, for most domains, the proposed C22MP is a state-of-the-art anomaly detector.Additionally, in this work, we illustrate how we can detect anomalies in multidimensional time series by framing the problem as K of N anomaly detection. The primary challenge in multidimensional time series anomaly detection appears to be that, in any N-dimensional time series, anomalies typically manifest themselves only in K of the time series, where K < N. This leads to a chicken-and-egg problem. If we knew which K time series exhibited the anomaly, it would be easy to discover its location. However, we do not know this in advance, and the search space is of size 2N and not obviously amiable to greedy search. In this work we show a novel, simple algorithm that allows us to quickly find the best K of N anomaly subset for any value of K. Moreover, we show a simple metric that can rank the top anomaly subsets for all values of K from 1 to N. While our methods are mostly agnostic to the anomaly scoring model, for concreteness we use the Matrix Profile, and show that we can discover multi-dimensional anomalies that would escape detection by all current rival methods.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-