Skip to main content
eScholarship
Open Access Publications from the University of California

UC Riverside

UC Riverside Electronic Theses and Dissertations bannerUC Riverside

The Matrix Profile: Scalable Algorithms and New Primitives for Time Series Data Mining

Abstract

Primitives such as motifs, discords, shapelets, etc., are widely used in time series data mining. A versatile approach to find these primitives is through computing similarity joins for time series subsequences. The last decade has seen a significant amount of research effort on similarity joins in domains such as text and DNA, but not much progress has been made on similarity joins for time series subsequences. The lack of progress is probably due to the daunting nature of the problem: for even modest sized datasets the brute-force algorithm can take months to complete. Typical speed-up techniques such as indexing and lower-bounding at best produce only one or two orders of magnitude speedup, and their performance can degrade significantly in the face of an unfavorable dataset.

In this dissertation we introduce a suite of algorithms that significantly expand the limit of scalability for time series subsequences similarity joins. These algorithms not only provide the fastest exact solution to the discovery of time series motifs, one of the most extensively-studied time series data mining primitives in the last decade, but also allow the invention of new primitives that the state-of-the-art could not support.

Specifically, we present a novel batch algorithm which, when combined with the GPU framework, can find the full set of exact motifs on a dataset two orders of magnitude larger than the literature limit in feasible time. A novel fast-converging anytime algorithm further expands this scalability, allowing motif discovery for million-scale datasets to be performed in interactive sessions with an off-the-shelf desktop. We also show how these techniques can be combined with a novel lower bound to allow fast motif discovery in the presence of missing data. Furthermore, we introduce time series chains, a new time series data mining primitive that can capture the evolution of systems and help predict the future.

We demonstrate the utility of our ideas in domains as diverse as seismology, entomology, human activity monitoring, electrical power-demand monitoring and medicine.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View