Developing Efficient Algorithms for Data Mining Large Scale High Dimensional Data
- Author(s): Zakaria, Jesin
- Advisor(s): Keogh, Eamonn
- et al.
Data mining and knowledge discovery has attracted a great deal of attention in information technology in recent years. The rapid progress of computer hardware technology in the past three decades provides a great enhancement to the database and information industry. The size and complexity of real world data is dramatically increasing with the growth of hardware technology. Although new efficient algorithms to deal with such data are constantly being proposed, the mining of large scale high dimensional data still presents a lot of challenges. In this dissertation, several novel algorithms are proposed to handle such challenges. These algorithms are applied to domains as diverse as electrocardiography (ECG), stock market data, geospatial data, power supply data, audio data, image data, etc. This dissertation contributes to the data mining community in the following three ways:
Firstly, we propose a novel algorithm for clustering time series data efficiently in the presence of noise or extraneous data. Most existing methods for time series clustering rely on distances calculated from the entire raw data. As a consequence, most work on time series clustering only considers the clustering of individual time series "behaviors," e.g., individual heart beats and contrives the time series in some way to make them all equal in length. However, for any real world problem, formatting the data in such a way is often a harder task than the clustering itself. In order to remove these unrealistic assumptions, we have developed a new primitive called unsupervised shapelet or u-shapelet and shown its utility for clustering time series.
Secondly, in order to speed up the discovery of u-shapelet and make it scalable we have proposed two optimization techniques which can speed up the unsupervised shapelet discovery independently of each other. Moreover, if we combine the two optimization procedures, it results in a super linear speedup. In addition to the above, we can also cast our u-shapelet discovery algorithm as an anytime algorithm.
In my final contribution, we have developed a novel and robust algorithm for mining mice vocalizations with symbolized representation. Our algorithm processes large scale, high dimensional, noisy mice vocalization by dimensionality reduction and cardinality reduction and make it suitable for knowledge discovery like classification, clustering, similarity search, motif discovery, contrast set mining etc.