This dissertation is motivated from enabling various tasks in large scale data mining of time series to produce more accurate, reproducible results and tailored to user’s specific need when that is favored. To that end, we have explored and contributed to the literature in three parts; each touches an active area of research and unifies under a common theme, reducing errors in time series data mining by learning constraints on model’s flexibility.
The first body of work concerns Dynamic Time Warping (DTW), a highly competitive distance measure for most time series data mining problems. Obtaining the best performance from DTW requires setting its only parameter, the maximum amount of warping. This parameter gives DTW the flexibility to deal with data that can be locally out of phase, however the DTW algorithm sometimes exploits this flexibility to give pathological and unwanted results. We demonstrate the importance of setting DTW’s warping window width correctly, to constrain this flexibility, and we propose novel methods to learn this parameter in both supervised and unsupervised settings.
The second body of work concerns time series motif discovery, perhaps the most used primitive for time series data mining. We point out that the current definitions of motif discovery are limited and can create a mismatch between the user’s intent/expectations, and the motif discovery search outcomes. We explain the reasons behind these issues and introduce a novel and general framework to address them.
The last body of work concerns making more time series data sets and baseline results publicly available for gauging progress and comparison of rival approaches in spirit of reproducible research. We work on expanding the UCR Time Series Archive, an important resource in the time series data mining community, from 85 data sets since the last Fall 2015 release to 128 data sets in Fall 2018. Creating benchmark results for this archive required 61,041,100,000,000 DTW comparisons, greatly more than the number of DTW comparisons that have appeared in all research papers combined. Beyond expanding this valuable resource, we offer pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive.