Exploiting Time Series Primitives to Solve Realistic Data Mining Problems
- Author(s): Begum, Nurjahan
- Advisor(s): Keogh, Eamonn
- et al.
Given the ubiquity of time series data in scientific, medical and financial domains, data miners have made substantial efforts to design efficient algorithms for classification, clustering, rule discovery and anomaly detection for this data type. In particular, all these well-known problems are important as exploratory techniques, and as sub-modules for answering higher level data mining questions. In essence, this means designing simple but efficient algorithms for these primitive modules is a critical part to solve data mining problems with higher complexity. In real-life scenarios, where the time series data is noisy and high-scale, it is important that the algorithms are invariant to irrelevant data and are amiable to optimizations. In addition, the algorithms should be parameter-lite and should involve least possible human intervention. Developing efficient algorithms to solve realistic data mining problems with these design criteria has been a challenging task to researchers over the last few decades. In this dissertation, we propose techniques to design such robust algorithms in classification, clustering and motif discovery context. In particular, the core contribution of this dissertation is as follows:
First, we propose techniques for the previously unsolved problem of finding the stopping criterion of semi-supervised time series classification. The core assumption of most of the time series classification algorithms per se is the availability of copious amount of labeled instances. However, this assumption is often unrealistic, and requires careful attention of data experts, which is a critical bottleneck in data analysis. Semi-supervised learning is an obvious way to mitigate the need for human labor, however, most such algorithms are designed for intrinsically discrete objects, and do not work well in time series domain, which requires the ability to deal with real-valued objects arriving in streaming fashion. In this work, we demonstrate that in many cases, just a handful of human annotated examples are sufficient to perform accurate classification. Also, we devise a novel parameter-free stopping criterion for semi-supervised time series classification.
Second, we propose an algorithm for robustly clustering large time series datasets with invariance to irrelevant data. In this work, we show that for similarity search, even though there is the general superiority of Dynamic Time Warping (DTW) distance measure over Euclidean distance (ED), the same is not true for clustering. Thus, clustering time series under DTW remains a computationally challenging task. We propose a novel pruning strategy that exploits both upper and lower bounds to prune off a large fraction of the expensive distance calculations. For datasets where even this level of speedup is inadequate, we show that we can use a simple heuristic to order the unavoidable calculations in a most-useful-first ordering, thus casting the clustering as an anytime algorithm.
Finally in the last part of the dissertation, we investigate algorithms to find time series motifs which have been shown to have great utility as a subroutine in many higher-level data mining algorithms. As we show, this detection becomes much harder in cases where the motifs of interest are vanishingly rare or when faced with a never-ending stream of data. We demonstrate that under reasonable assumptions, we must abandon any hope of an exact solution to the motif problem as it is normally defined; however we introduce algorithms that allow us to solve the underlying problem with high probability.