Contributions to Directional Statistics Based Clustering Methods
- Author(s): Wainwright, Brian
- Advisor(s): Jammalamadaka, Sreenivasa R
- et al.
Statistical tools like the finite mixture models and model-based clustering have been used extensively in many fields such as natural language processing and genomic research to inves- tigate everything from copyright infringement to unraveling the mysteries of the evolutionary process. In model-based clustering, the samples are assumed to be realizations of a mixture distribution consisting of one or more mixture components, and the model attempts to discern what this original model is, given the observed data. In our investigation we explore directional distributions on the circle, the sphere, and the hypersphere, where the component distributions are themselves respectively the von Mises distributions in 2-dimensions, the von Mises-Fisher distributions in 3-dimensions, and ?-dimensional von Mises-Fisher distributions for large ?. In each case, the observations lie on the circle, the unit-sphere, or the hypersphere ??−1 embedded in R?, due to the inherent structure of the data, or by normalizing the curves. We look specifically at clustering curves around the unit circle ?1, treating them first as mixture distributions, and in an alternate approach, as functional data that can be explored via their Fourier coefficients. We also investigate clustering high-dimensional, extremely sparse textual data, by looking at twitter data from the day of the 2016 United States presidential election as document vectors on the unit hypersphere. Finally, we introduce and discuss a broad family of spherical distributions that we call the “Generalized Fisher-Bingham family," and present details of a software package that we developed to simulate and visualize members of this family.