Skip to main content
eScholarship
Open Access Publications from the University of California

Time Series Retrieval: Indexing and Mining Large Datasets

  • Author(s): Shieh, Jin-Wien
  • Advisor(s): Keogh, Eamonn
  • et al.
Abstract

As advances in science and technology have continually increased the existence of, and capability for users to monitor, record, and examine data, data mining has become a common and necessary toolset in order to gain additional insight on this influx of data. In this dissertation, we study methods which are used for overcoming the characteristic challenges of scale in order to perform similarity search on large time series datasets.

We introduce a novel multi-resolution symbolic representation for time series called indexable Symbolic Aggregate approXimation (iSAX). The iSAX representation allows for the indexing of time series in order to facilitate similarity search. We further demonstrate its utility by performing experimental evaluation on a wide range of diverse datasets and show how exact and approximate search can be used in conjunction to expedite higher level data mining operators to solve real world problems. The size of the datasets we consider are larger than any other in the current literature and notably, our results confirm the notion that even simple measures perform exceedingly well when the training set becomes very large.

Another aspect of our research considers using similarity search to perform classification under limited computation time and variable response rates. In such contexts, anytime algorithms, amenable to variable response times by exchanging quality of response as a function of time, have been found to be especially useful. We present a generalized framework which utilizes a scoring function that estimates the intermediate result quality of an object being classified. Our contribution extends existing anytime algorithms to concurrent queries by dynamically scheduling computational resources for each object. We show that the lack of such inter-object consideration would otherwise result in poor allocation of computation time and lead to reduced performance.

Main Content
Current View