Hamerly, Greg; Elkan, Charles

Learning the k in k-means

2002

Abstract

When clustering a dataset, the right number $k$ of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present a new algorithm for choosing k that is based on a new statistical test for the hypothesis that a subset of data follows a Gaussian distribution. The algorithm runs k-means with increasing k until the test fails to reject the hypothesis that the data assigned to each k-means center are Gaussian. We present results from experiments on synthetic and real-world data showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity.

Pre-2018 CSE ID: CS2002-0716

Main Content

For improved accessibility of PDF content, download the file to your device.

Department of Computer Science & Engineering

Learning the k in k-means