Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

Efficient Algorithms for High Dimensional Data Mining

Abstract

Data mining and knowledge discovery has attracted research interest in the last decade. The size and complexity of real world data is dramatically increasing and although new efficient algorithms to deal with such data are constantly being proposed, the mining of high dimensional data still presents a challenge. In this dissertation, several novel algorithms are proposed to handle such datasets. These algorithms are applied to domains as diverse as electrocardiography (ECG), electroencephalography (EEG), human DNA sequencing, protein sequencing, stock market data, gesture recognition data, motion capture data, accelerometer data, audio data, image data, handwritten manuscripts, etc. This dissertation contributes to the data mining community in three ways:

Firstly, we propose a novel algorithm for searching for the nearest neighbor in time series data by using multi-level lower bounding techniques and other speed-up techniques. The proposed algorithm, called UCRSuite, is faster than the previous state-of-the-art by several orders of magnitude. Because search algorithms are primitive and a bottleneck in complex data mining algorithms, this contribution is likely to make a significant impact. Secondly, we propose two approximation algorithms to handle the high dimensional data. A fast shapelet discovery algorithm, called FastShapelet, has been proposed to discover approximate shapelets, which are as accurate as those found by an exact search. In addition, we show an unsupervised algorithm, called DocMotif, which can discover similar figures from given manuscripts. The proposed algorithms are faster than the best known algorithms by two or three orders of magnitude and the discovered results are not measurably different from the exact algorithm. Moreover, in the second work, a detailed mathematical analysis for bounding an error is provided.

In my final contribution, we show that in order to create a useful clustering of a single time series, an algorithm must have the freedom to ignore some data. We propose a Minimum Description Length based time series clustering algorithm that has this ability. My results demonstrate that not only is the proposed algorithm parameter-free, but it is also efficient and effective for time series clustering.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View