- Main
Large-scale Inference of Correlation between Complex Biological Traits
- Zhang, Zhenyu
- Advisor(s): Suchard, Marc A.
Abstract
Inferring dependencies between complex biological traits while accounting for evolutionary relationships among specimens is of great scientific interest, yet remains infeasible when trait and specimen counts grow large. I aim to develop a scalable Bayesian inference framework to assess correlation between complex traits along the evolutionary tree relating the specimens and informed by molecular sequences. To accommodate discrete and continuous traits, I posit a phylogenetic multivariate probit model that uses a latent variable framework. Posterior computation under this model requires integrating many latent variables, or equivalently making many computationally expensive draws from a high-dimensional multivariate truncated normal distribution (MTN). To tackle this challenge, I propose an inference scheme that exploits 1) representative cutting-edge Markov chain Monte Carlo (MCMC) methods including the bouncy particle sampler (BPS), the Markovian Zigzag sampler (ZZ), and the Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) that can simultaneously sample all truncated normal dimensions, and 2) novel dynamic programming strategies that reduce the cost of likelihood and gradient evaluations for all three samplers to linear in sample size. Compared to the previous best practices that employ multiple-try rejection sampling, my approach achieves an order-of-magnitude speedup, allowing us to tackle previously unworkable large-scale problems. In an application with 535 HIV-1 viruses and 24 traits that necessitates sampling from a 11,235-dimensional MTN, my method makes it possible to examine the conditional dependencies between 21 immune escape mutations and 3 virulence measurements. In a second application I study the evolution of influenza H1N1 glycosylations using around 900 viruses. Lastly, I extend the phylogenetic probit model to incorporate categorical traits and demonstrate its use to investigate Aquilegia flower and pollinator coevolution. In summary, the contribution of this dissertation is two-fold. First, I develop a state-of-the-art solution for the long-standing problem in Bayesian phylogenetics | learning correlation among complex biological traits with joint tree modeling. Second, further empirical and theoretical investigation of BPS, ZZ, and Zigzag-HMC yield insight into the differences and similarities between these recently developed MCMC samplers. As Zigzag-HMC outperforms the other two on MTNs, I also implement this approach in a standalone R package, aiming to provide a general efficient tool for high-dimensional MTN simulation.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-