Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Clustering and Mixture Modeling: Some Methodology and Theory

Abstract

In this dissertation, we propose several methodology in clustering and mixture modeling when the user has some prior knowledge regarding one or more of the clusters or mixture components. We also provide theory in the consistency of K-medoids, and prove its consistency under two versions of ordinal information.

We begin by considering the problem of clustering and mixture modeling in situations where the user has relevant prior knowledge on the center separation between clusters or mixture components. We propose a dynamic programming approach to solving the K-means problem exactly with a separation constraint on the centers in the context of real-valued data, and propose an EM algorithm that incorporates such a constraint in the context of fitting a Gaussian mixture model.

We then move onto two-component mixture models, where one component plays the role of the background component. We consider situations where the user has prior information on this background component’s distribution: symmetric, monotonic, or log-concave. In each setting, we derive estimators for the background component and provide relevant theory. While estimating the log-concave background distribution, we also propose a method to estimate the largest concave minorant.

In addition, we consider the situation where the user has information on the number of modes of a mixture density. We provide a dynamic programming approach to fitting a density function by maximum likelihood when it is constrained to have a given number of modal intervals. When this value is not available, we provide several data-driven ways for selecting it.

For the last project, we establish the consistency of K-medoids in the context of metric spaces. Our general approach applies also to non-metric settings where only an ordering of the dissimilarities is available. We consider two types of ordinal information: one where all quadruple comparisons are available; and one where only triple comparisons are available.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View