Mining Time Series Data: Moving from Toy Problems to Realistic Deployments
- Author(s): Hu, Bing
- Advisor(s): Eamonn, Keogh
- et al.
Data mining and knowledge discovery has attracted a lot of research interest in the last decade. Although there is extensive research in this area, we argue that most of the work is not as useful, since the datasets that they are dealing with and the methods that they proposed to solve the problems are more like `toy examples' compared to the much more complicate real-world scenario. We have observed the following two problems that widely exist in most of data mining research. First, parameters will hurt the potential of spreading the ideas in the research community. In a lot of works, there are usually several parameters to tune in the proposed method. We claim that the parameter turning can kill the usefulness of an algorithm and reduce the number of citations. Second, the prevalently existed assumptions about the data further limit their application to solve the real-world problem. We strive to mitigate the above two problems. The contribution of this dissertation is as follows:
First, we demonstrate a parameter free framework using MDL to discover the intrinsic features of the data. With the intrinsic cardinality and dimensionality of the time series, we can further understand the underlying meaning of the data, before consulting the domain experts. In addition, the intrinsic features can be used as dimensionality reduction and have huge applications in the various lower bounding techniques. Second, we show a time series classification framework that has none of the prevalent assumptions. We propose to use the data editing technique to automatically build a data dictionary. In addition, our classification framework has the capability to say `I do not know' at a certain point when classifying the incoming queries that does not belong to any concept in the training data. Our results show that a small fraction of all the data can achieve even better classification results than using all the data. In the last, we propose a dynamically weighted multi-dimensional classification framework, which can smartly choose the weight of each data dimension. The results over extensive datasets from various domains show that our framework is more accurate and robust to the occluded data.