Wu, Renjie

Problems with Problems in Data Mining

2024

Wu, Renjie
Advisor(s): Keogh, Eamonn

Abstract

Although the term "data mining" did not appear until the 1990s, the process of digging data to discover correlations, patterns and knowledge has a long history. As the rapid growth of data in size and complexity, data mining has augmented manual data processing with automated data analysis assisted by intertwined scientific fields, such as statistics, database systems, and machine learning. However, we unfortunately discovered that several highly cited papers in the field of data mining have surprising problems with their proposed algorithm, datasets, or definition:

1) Not so fast algorithm. For over two decades, Dynamic Time Warping (DTW) has been known as the best measure to use for most tasks, in most domains. Because the classic DTW takes quadratic time, FastDTW purportedly offers a way to quickly approximate it. The FastDTW algorithm has well over two thousand citations and has been explicitly used in several hundred research efforts. However, we show that in any realistic settings, the approximate FastDTW is much slower than the exact DTW. 2) Not so good datasets. In recent years, there has been an explosion of interest in time series anomaly detection (TSAD), driven by the success of deep learning in other domains. Most of these papers test on one or more of popular benchmark datasets from Yahoo, Numenta, NASA, etc. However, we show that the majority of the individual exemplars in these datasets suffer from one or more of four flaws. Because of these four flaws, much of the apparent progress in recent years may be illusionary.

3) Not so clear definition. Early classification of time series (ETSC) generalizes classic time series classification to ask if we can classify a time series subsequence with sufficient accuracy and confidence after seeing only some prefix of a target pattern. The idea is that the earlier classification would allow us to take immediate action when some practical interventions are possible. However, we show that under reasonable assumptions, no current ETSC algorithm is likely to work in a real-world setting.

In addition to demonstrating our findings, we either provide potential solutions to address these problems, e.g., UCR Time Series Anomaly Archive, or offer recommendations to the community, e.g. specifications for the definition of ETSC.

UC Riverside

Problems with Problems in Data Mining