Bayesian Modeling for Heterogeneous Multivariate Data
- Author(s): Lui, Arthur Lui Laureano
- Advisor(s): Lee, Juhee
- et al.
This dissertation, comprising three projects, presents Bayesian statistical methods for analyzing heterogeneous multivariate data, with application to marker expression data obtained from cytometry at time-of-flight (CyTOF). In the first project, a Bayesian feature allocation model (FAM) is presented for identifying cell subpopulations based on multiple samples of cell surface or intracellular marker expression level data obtained by CyTOF. Cell subpopulations are characterized by differences in expression patterns of markers, and individual cells are clustered into the subpopulations based on the patterns of their observed expression levels. A finite Indian buffet process is used to model subpopulations as latent features, and a model-based method based on these latent feature subpopulations is used to construct cell clusters within each sample. Non-ignorable missing data due to technical artifacts in mass cytometry instruments are accounted for by defining a static missingship mechanism. The second project builds upon the first by introducing a repulsive FAM (rep-FAM) which restructures the probability distribution of a traditional FAM to identify features more likely to be distinct from each other. The problem that a conventional FAM has a positive probability of repeating a feature is eliminated by the rep-FAM, which also increases the probability of larger differences between features. The rep-FAM thus yields clusters that are more biologically interpretable than those identified by a conventional FAM. The third project presents methods for differential distributions between two experimental conditions, in the context of CyTOF data. A zero-inflated mixture of log-skew-t distributions is used to model the multi-modal, heavy tailed, and often highly skewed distributions that arise from these marker expression levels. A distance metric is proposed to quantify the degree of difference between distributions under various experimental conditions. In each chapter, we explore the performance and limitations of our proposed methodologies through simulation studies and real data analyses.