Supervised and Unsupervised Discovery of Structures in Large Data Archives
- Author(s): Hao, Yuan
- Advisor(s): Keogh, Eamonn
- et al.
Most domains of human interest now generate enormous, diverse data (text, time series, image, audio and video, etc) everyday. Extracting useful knowledge from such data in an efficient manner is an essential task for data mining community. A general framework to discover useful structures without domain-dependent tuning can also mitigate the costly manual efforts for different area experts, such as biologist, neurologist, cardiologist, etc.
This dissertation first discusses definitions and representations to find useful structures (for example, audio fingerprint, audio motifs) in audio archives, and further introduces scalable algorithms to allow application to diverse massive data archives. Audio fingerprints are "prototypical" subsequences that can represent a class, differentiate from other classes and used to identify future unknown instances. We propose a supervised approach to classify animal sounds in the visual space, by treating the texture of their spectrograms as an acoustic fingerprint using a recently introduced parameter-free texture measure as a distance measure. Our audio fingerprint discovery bioacoustic framework assists biologists in automatically classifying different species of insects, (and, in follow up work by an independent research group) detect the presence of elephants in noisy environments, etc.
Motif discovery in contrast is an unsupervised process to find occurrences of repeated patterns when lacking any prior knowledge of patterns, even the pattern length. The audio motif/near duplicate pairs are the most similar segments among all the subsequences in any audio stream, however, they must be carefully defined in order to prevent finding pathological solutions. We propose a novel probabilistic early abandoning approach to cast the search for audio motifs into Anytime framework. We demonstrate that our algorithm can apply to diverse domains (i.e. mice vocalization, wild animal sounds, music and human speech) without requiring any domain specific tuning.
Lastly, we propose a never-ending learning framework for time series in which an agent examines an unbounded stream of data and occasionally asks a teacher (which may be a human or an algorithm) for a label. We demonstrate the utility of our ideas with experiments that consider real world problems in domains as diverse as medicine, entomology, wildlife monitoring, and human behavior analyses.