Efficient Non-parametric Methods for Phylogenomics Inference Using Tree Topology and Branch Length
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Efficient Non-parametric Methods for Phylogenomics Inference Using Tree Topology and Branch Length

Abstract

Computational phylogenetics has been dominated by model-based parametric inference, which is sensitive to model misspecification and has high computational burden. In this dissertation, I present efficient non-parametric methods to facilitate phylogenetic analyses. These methods connect tree topology with the molecular clock principle - as manifested by tree branch lengths - using discrete and numerical optimization. I conclude with an application on one of the largest phylogenomic analyses performed to date. Chapter 1 presents TreeShrink, a non-parametric algorithm to filter errors from phylogenetic trees. I propose an efficient algorithm to solve a new optimization problem that I define. Based on the set of the solutions to this problem, TreeShrink computes the impact of individual species on the tree diameter and transforms anomaly detection in tree space to outlier detection in Euclidean space. Our results on simulated and empirical data show effective outlier detection without making assumptions about root placement or branch length distributions. While TreeShrink focuses on branch length, Chapter 2 focuses on topology. It introduces tripVote, a polynomial time algorithm to complete a set of gene trees without a reference species tree. Each gene tree is imputed such that its total quartet distance to the other gene trees is minimized. I develop a quasi-linear time algorithm to solve this problem and show that tripVote is accurate on both simulated and empirical data. Chapter 3 presents MinVar, a method for tree rooting. MinVar roots a given tree at the point that minimizes the root-to-tip variance. I present a linear-time algorithm to find the MinVar point and prove important properties that make it consistent with the molecular clock principle. Empirically, MinVar is more accurate than other linear-time rooting methods. Chapter 4 introduces wLogDate, a method that computes divergent times by minimizing the weighted least squares of the log-transformed mutation rates from their mean value. The advantage of this log-transformed penalty function over those derived from a strict-clock model is demonstrated both theoretically and empirically. On simulated data, wLogDate is more accurate than the alternatives in most model conditions and is more than ten times faster than a state-of-the-art method that uses Bayesian MCMC. Chapter 5 introduces CatTime, a method that uses a categorical distribution to fit the unknown clock model and co-estimates the rate categories and branch lengths in time units using an EM-based algorithm. On simulated data of Angiosperms and HIV, the method is more accurate than the alternatives when the clock model contains local clocks or heterogeneous rates. The last Chapter describes an empirical analysis that infers a phylogenetic tree of 10,575 microbial species. This analysis used some of the aforementioned methods and motivated the development of others. The resulting dataset has been used as a reference library in many downstream analyses.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View