Skip to main content
eScholarship
Open Access Publications from the University of California

Exploiting Intrinsic Clustering Structure in Discrete-Valued Data Sets for Efficient Knowledge Discovery in the Presence of Missing Data

  • Author(s): Strnadova-Neeley, Veronika
  • Advisor(s): Gilbert, John R
  • et al.
Abstract

Scalable algorithm design has become central in the era of large-scale data analysis. The vast amounts of data pouring in from a diverse set of application domains, such as bioinformatics, recommender systems, sensor systems, and social networks, cannot be analyzed efficiently using many data mining and statistical tools that were designed for a small scale setting. It is an ongoing challenge to the data mining, machine learning, and statistics communities to design new methods for efficient data analysis. Confounding this challenge is the noisy and incomplete nature of real-world data sets. Research scientists as well as practitioners in industry need to find meaningful patterns in data with missing value rates often as high as 99%, in addition to errors in the data that can obstruct accurate analyses.

My contribution to this line of research is the design of new algorithms for scalable clustering, data reduction, and similarity evaluation by exploiting inherent clustering structure in the input data to overcome the challenges of significant amounts of missing entries. I demonstrate that, by focusing on underlying clustering properties of the data, we can improve the efficiency of several data analysis methods on sparse, discrete-valued data sets. I highlight new methods that I have developed with my collaborators for three diverse knowledge discovery tasks: (1) clustering genetic markers into linkage groups, (2) reducing large-scale genetic data to a much smaller, more accurate representative data set, and (3) computing similarity between users in recommender systems. In each case, I point out how the underlying clustering structure can be used to design more efficient algorithms, even when high missing value rates are present.

Main Content
Current View