Skip to main content
Open Access Publications from the University of California

Sampling and Subsampling for Cluster Analysis in Data Mining, with Applications to Sky Survey Data


This paper describes a clustering method for unsupervised classification of objects in large data sets. The new methodology combines the mixture likelibood approach with a sampling and subsampling strategy in order to cluster large data sets efficiently; This sampling strategy can be applied to a large variety of data mining methods to allow them to be used on very large data sets. The method is applied to the problem of automated star/galaxy classification for digital sky data and is tested using a sample from the Digitized Palomar Sky Survey (DPOSS) data. The method is quick and reliable and produces classifications comparable to previous work on these data using supervised clustering.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View