Automatic Co-Clustering for Social Network and Medical Data
- Author(s): Casse, Juan Ignacio
- Advisor(s): Shelton, Christian
- et al.
The task of clustering is a fundamental task in many important human endeavors. In machine learning parlance, it is an unsupervised learning tool for discovering patterns in data. Specifically, its goal is to find groups of objects in the data that are similar in some sense. Some important fields where clustering is used include medical diagnostics, bioinformatics, social network analysis and market analysis. Clustering is also used "behind the scenes" as a preprocessing step to other tasks, such as Web search and recommender systems.
Co-clustering can be viewed as a generalization of clustering to a wider range of data. While clustering methods work on affinity data (data describing similarity between objects), co-clustering methods can also work on relational data (data describing relationships between objects). An example of affinity data is customers in market analysis, where each customer is described by a set of features (attributes), such as age, gender and income. A similarity measure between pairs of customers can be computed from their features, for example Euclidean distance. An example of relational data is persons in a social network, where a link between two persons indicate that they are friends. Here persons are compared on their connections to other persons and not on their intrinsic features.
In this dissertation we study the application of co-clustering to social network data and to medical data. In particular, we present a general formulation of co-clustering that fits most methods in the literature and provide solutions to three main problems: (1) clustering relational data under regular equivalence in social network analysis, (2) finding a symmetric clustering of asymmetric data and (3) clustering patients based on high-dimensional, time-varying, sparse physiologic data.
We define implicit similarity measures, by way of criterion functions for co-clustering, that solve the problems we target. We demonstrate and compare our co-clustering methods on real world data sets.