Skip to main content
Download PDF
- Main
Learning the k in k-means
Abstract
When clustering a dataset, the right number $k$ of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present a new algorithm for choosing k that is based on a new statistical test for the hypothesis that a subset of data follows a Gaussian distribution. The algorithm runs k-means with increasing k until the test fails to reject the hypothesis that the data assigned to each k-means center are Gaussian. We present results from experiments on synthetic and real-world data showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity.
Pre-2018 CSE ID: CS2002-0716
Main Content
For improved accessibility of PDF content, download the file to your device.
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Page Size:
-
Fast Web View:
-
Preparing document for printing…
0%