- Main
Novel Multiresolution Pattern Detection Techniques in Large Scale Time Series Data
- Shahcheraghi, Seyedehmaryam
- Advisor(s): Keogh, Eamonn
Abstract
Time series data are one the most important data type in data science field. Analyzing the local patterns in time series data can be leveraged in diverse domains such as astronomy, biology, entomology, economics, etc. Time series analytical reports can help the experts in these field to not only explain the behavior of their system, but to make predictions and even get an insight into what has been unknown for them. For instance, the repetition of a behavior in a different scale or speed might have been unseen from a sight of the experts in a large scale time series record. Anomalies can be found in time series data fairly easily. Approximately conserved patterns in time series, motifs, are another term of our interest.Time series similarity matrices (informally, recurrence plots or dot-plots), are useful tools for time series data mining. They can be used to guide data exploration, and various useful features can be derived from them and then fed into downstream analytics. However, time series similarity matrices suffer from very poor scalability, taxing both time and memory requirements. In this dissertation, we introduce novel ideas that allow us to scale the largest time series similarity matrices that can be examined by several orders of magnitude. The first idea is a novel algorithm to compute the matrices in a way that removes dependency on the subsequence length. This algorithm is so fast that it allows us to now address datasets where the memory limitations begin to dominate. Our second novel contribution is a multiscale algorithm that computes an approximation of the matrix appropriate for the limitations of the user’s memory/screen-resolution, then performs a local, just-in-time recomputation of any region that the user wishes to zoom-in on. Given that this largely removes time and space barriers, human visual attention then becomes the bottleneck. We further introduce algorithms that search massive matrices with quadrillions of cells and then prioritize regions for later examination by either humans or algorithms. We demonstrate the utility of our ideas for data exploration, segmentation, and classification in domains as diverse as astronomy, bioinformatics, entomology, and wildlife monitoring. Moreover, as noted approximately repeated subsequences in a longer time series, i.e. time series motifs, are important primitive in time series data mining. Motifs are used in dozens of downstream tasks, including classification, clustering, summarization, rule discovery, segmentation etc. Time series motif discovery is notoriously computationally expensive task. Some motif discovery algorithms are fast in the best case, but in other datasets, even if both the data and motif lengths are held the same, both their time and space complexity can explode. The Matrix Profile has the nice property that its time and space complexity are independent of the data. Moreover, the Matrix Profile is fast enough for datasets in the with say one million datapoints, which covers a large fraction of user cases. However, there are situations where we may wish to consider datasets which are much larger. We introduce the first lower bound for the Matrix Profile and an algorithm that exploits that lower bound to allow orders of magnitude speeds on most real-world datasets. We demonstrate the utility of our ideas with the largest and most ambitious motif discovery experiments ever attempted.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-