Advances in technology have expedited the use of acquisition ability in computers to obtain data from diverse sources via sensors or imaging techniques with high throughput. The collected data usually tend to be extremely large, and processing a large volume of data requires computationally intensive resources. Clustering techniques simplify the data by partitioning it into meaningful groups and allow us to analyze a large volume of data in a relatively short period of time with high accuracy.
This dissertation introduces several novel approaches that improve the performance of semi-supervised and unsupervised clustering by utilizing the concept of locality. It makes two specific contributions:
1. Magnetically Affected Paths: A novel approach to apply the user-defined constraints through local manipulations in semi-supervised clustering. MAP refines the clustering results by increasing the weight of the edges connecting the objects that are in the neighborhood of a cannot-link constraint, and decreasing the weight of the edges connecting the objects that are in the neighborhood of a must-link constraint. MAPClus framework introduced in this dissertation integrates the MAP concept into the clustering algorithms by applying a three-step algorithm. The efficacy of the algorithm is demonstrated through extensive experimental evaluations on several synthetic and real datasets.
2. Wavelet-Based Similarity Measures: A family of similarity measures which exploits the ability of wavelet transformation to analyze the spectral components of the physicochemical properties and suggests a more sensitive way of measuring the similarity of biological molecules. We demonstrate the validity of our wavelet-based similarity measures by employing them in two different protein clustering applications. In the first set of experiments, we use the measures to identify the relationships between mutant proteins that were obtained by alanine scanning. Additionally, we present how accurate our methods are in recognizing the connection between charge density and electrostatic potential in homology models.