Skip to main content
eScholarship
Open Access Publications from the University of California

Probabilistic clustering using hierarchical models

Abstract

This paper addresses the problem of clustering data when the available data measurements are not multivariate vectors of fixed dimensionality. For example, one might have data from a set of medical patients, where for each patient there are time series, image, text, and multivariate data. We propose a general probabilistic clustering framework for clustering heterogeneous data types of this form. We focus on two-level probabilistic hierarchical models, consisting of a high-level mixture model on parameters and a low-level model for observations. This general framework permits probabilistic clustering of "objects" (sequences, histograms, images, etc) using an extension of the expectation-maximization (EM) algorithm which we derive. We further show that earlier (intuitive) clustering algorithms can be viewed as special cases (approximations) of the framework proposed here. The paper includes several illustrations of the method, including an application to a problem in clustering two-dimensional histograms of red blood cell data in a medical diagnosis context.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View