Rocke, David; Woodruff, David

Computational Connections Between Robust Multivariate Analysis and Clustering

2002

Abstract

In this paper we examine some of the relationships between two important optimization problems that arise in statistics: robust estimation of multivariate location and shape parameters and maximum likelihood assignment of multivariate data to clusters. We offer a synthesis and generalization of computational methods reported in the literature. These connections are important because they can be exploited to support effective robust analysis of large data sets. Recognition of the connections between estimators for clusters and outliers immediately yields one important result that is demonstrated by Rocke and Woodruff (2002); namely, the ability to detect outliers can be improved a great deal using a combined perspective from outlier detection and cluster identification. One can achieve practical breakdown values that approach the theoretical limits by using algorithms for both problems. It turns out that many configurations of outliers that are hard to detect using robust estimators are easily detected using clustering algorithms. Conversely, many configurations of small clusters that could be considered outliers are easily distinguished from the main population using robust estimators even though clustering algorithms fail. There are assumed to be n data points in RP and we may refer to them sometimes as a set of column vectors, {Xi} = {Xiii = 1,2,...,n}. We are concerned here primarily with combinatorial estimators and restrict ourselves to those that are affine equivariant.

Main Content

For improved accessibility of PDF content, download the file to your device.

Institute for Data Analysis and Visualization

Computational Connections Between Robust Multivariate Analysis and Clustering