Institute for Data Analysis and Visualization
Sampling and Subsampling for Cluster Analysis in Data Mining, with Applications to Sky Survey Data
- Author(s): Rocke, David
- Dai, Jian
- et al.
This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelibood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently; This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.