Search

Scholarly Works (4 results)

Sort By:

Article
Peer Reviewed

Alternatives to the k-means algorithm that find better clusterings

Technical Reports (2002)

We investigate here the behavior of the standard k-means clustering algorithm and several alternatives to it: the k-harmonic means algorithm due to Zhang and colleagues, fuzzy k-means, Gaussian expectation-maximization, and two new variants of k-harmonic means. Our aim is to find which aspects of these algorithms contribute to finding good clusterings, as opposed to converging to a low-quality local optimum. We describe each algorithm in a unified framework that introduces separate cluster membership and data weight functions. We then show that the algorithms do behave very differently from each other on simple low-dimensional synthetic datasets, and that the k-harmonic means method is superior. Having a soft membership function is essential for finding high-quality clusterings, but having a non-constant data weight function is useful also.

Pre-2018 CSE ID: CS2002-0702

Cover page: Alternatives to the k-means algorithm that find better clusterings

Article
Peer Reviewed

Learning the k in k-means

Technical Reports (2002)

When clustering a dataset, the right number $k$ of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic problem. In this paper we present a new algorithm for choosing k that is based on a new statistical test for the hypothesis that a subset of data follows a Gaussian distribution. The algorithm runs k-means with increasing k until the test fails to reject the hypothesis that the data assigned to each k-means center are Gaussian. We present results from experiments on synthetic and real-world data showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity.

Pre-2018 CSE ID: CS2002-0716

Article
Peer Reviewed

Comparing Multinomial and K-Means Clustering for SimPoint

Technical Reports (2005)

SimPoint is a technique used to pick what parts of the program's execution to simulate in order to have a complete picture of execution. SimPoint uses data clustering algorithms from machine learning to automatically find repetitive (similar) patterns in a program's execution, and it chooses one sample to represent each unique repetitive behavior. These samples when taken together represent an accurate picture of the complete execution of the program. SimPoint is based on the k-means clustering algorithm, and recent work has proposed using a different clustering method based on multinomial models, but only provided a preliminary comparison and analysis. In this work we provide a detailed comparison of using k-means and multinomial clustering for SimPoint. We show that k-means performs better than the recently proposed multinomial clustering approach. We then propose two improvements, in the areas of feature reduction and the picking of simulation points, to the prior multinomial clustering approach, which allows multinomial clustering to perform as well as k-means. We then conclude by examining how to potentially combine multinomial clustering with k-means.

Pre-2018 CSE ID: CS2005-0841

Cover page: Comparing Multinomial and K-Means Clustering for SimPoint

Article
Peer Reviewed

Building a Hierarchy of Variable Length Intervals to Capture Hierarchical Phase Behavior

Technical Reports (2004)

Most programs are repetitive, where similar behavior can be seen at different execution times. Proposed algorithms automatically group these similar intervals of execution into phases, where the intervals in a phase have homogeneous behavior and similar resource requirements. These prior techniques have focused on using fixed intervals for finding phase behavior. Using fixed length intervals can make finding the true periodic repeating phase behavior difficult, since the fixed length intervals can be out of sync with the size of the ideal phase behavior. In addition, focusing only on a single fixed interval size limits the phase behavior to the phase behavior seen at that interval size, when in reality there is a whole hierarchy of phase behavior at many different intervals sizes. In this paper, we present an automated approach for breaking the program's execution up into variable length intervals that match the phase behavior of the program. We then provide an algorithm for creating a hierarchy of variable length intervals and use this hierarchy to expose a program's phase behavior from small to large time scales.

Pre-2018 CSE ID: CS2004-0781

Cover page: Building a Hierarchy of Variable Length Intervals to Capture
Hierarchical Phase Behavior