Generalized statistical methods for mixed exponential families
- Author(s): Levasseur, Cécile;
- et al.
This dissertation considers the problem of learning the underlying statistical structure of complex data sets for fitting a generative model, and for both supervised and unsupervised data-driven decision making purposes. Using properties of exponential family distributions, a new unified theoretical model called Generalized Linear Statistics is established. The complexity of data is generally a consequence of the existence of a large number of components and the fact that the components are often of mixed data types (i.e., some components might be continuous, with different underlying distributions, while other components might be discrete, such as categorical, count or Boolean). Such complex data sets are typical in drug discovery, health care, or fraud detection. The proposed statistical modeling approach is a generalization and amalgamation of techniques from classical linear statistics placed into a unified framework referred to as Generalized Linear Statistics (GLS). This framework includes techniques drawn from latent variable analysis as well as from the theory of Generalized Linear Models (GLMs), and is based on the use of exponential family distributions to model the various mixed types (continuous and discrete) of complex data sets. The methodology exploits the connection between data space and parameter space present in exponential family distributions and solves a nonlinear problem by using classical linear statistical tools applied to data that have been mapped into parameter space. One key aspect of the GLS framework is that often the natural parameter of the exponential family distributions is assumed to be constrained to a lower dimensional latent variable subspace, modeling the belief that the intrinsic dimensionality of the data is smaller than the dimensionality of the observation space. The framework is equivalent to a computationally tractable, mixed data-type hierarchical Bayes graphical model assumption with latent variables constrained to a low- dimensional parameter subspace. We demonstrate that exponential family Principal Component Analysis, Semi- Parametric exponential family Principal Component Analysis, and Bregman soft clustering are not separate unrelated algorithms, but different manifestations of model assumptions and parameter choices taken within this common GLS framework. Because of this insight, these algorithms are readily extended to deal with the important mixed data -type case. This framework has the critical advantage of allowing one to transfer high-dimensional mixed-type data components to low-dimensional common-type latent variables, which are then, in turn, used to perform regression or classification in a much simpler manner using well-known continuous-parameter classical linear techniques. Classification results on synthetic data and data sets from the University of California, Irvine machine learning repository are presented