Clustering: Algorithm, Optimization and Inference
- Author(s): Zhang, Zhanpan
- Advisor(s): Cui, Xinping;
- Jeske, Daniel R.
- et al.
Clustering is rapidly becoming a powerful data mining technique, and has been broadly applied to many domains. Usually data are arranged in a matrix with rows and columns, and each cell of this matrix is a real number. This dissertation aims at developing clustering algorithms with statistical inference incorporated in the following two scenarios.
First, when each cell of the data matrix is not represented by a single numerical value and instead contains a scatter plot, the existing clustering methods are not applicable any more. In this dissertation, we develop both hierarchical clustering and co-clustering procedure to handle a data matrix of scatter plots. To more accurately reflect the nature of data, we introduce a dissimilarity statistic based on "data depth" to measure the discrepancy between two bivariate distributions without oversimplifying the nature of the underlying pattern. We also propose novel painting metrics and construct heat maps to allow visualization of the clusters. We demonstrate the utility and power of our proposed clustering methods through simulation studies and application to a microbe-host-interaction study.
Second, when spatial information is embedded in the data matrix, the order of rows and columns can not be changed. Model-based spatial co-clustering has not been well studied. In this dissertation, we develop a co-clustering method using a Generalized Linear Mixed Model (GLMM) for spatial data. To avoid the high computational intensity associated with global optimization, we propose a heuristic optimization algorithm to search for a near optimal co-clustering. A sampling strategy is introduced to capture as much of the spatial information that is available from the sparse data as possible. For an application pertinent to Integrated Pest Management (IPM), we combine the spatial co-clustering technique with a statistical inference method to make assessment of pest density more accurate. We demonstrate the utility and power of our proposed pest assessment procedure through simulation studies and apply the procedure to a study of the persea mite (Oligonychus perseae), a pest of avocado trees.