Scalable algorithms for detecting boundaries and relationships of species from phylogenetic data
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Scalable algorithms for detecting boundaries and relationships of species from phylogenetic data

Abstract

Many disciplines in life sciences, directly or indirectly, utilize evolutionary relationships among species, and this has made species tree inference from sequence data one of the central problems in evolutionary biology. In the last decade, advances in sequencing technologies have drastically reduced the price for DNA sequencing and have led to the ubiquity of molecular sequence data. This data avalanche has impacted phylogenetics, and researchers are facing new computational challenges in handling the new data. The need for phylogeny inference and update methods that can be highly accurate on ultra-large datasets has increased. Researchers seek new methods that can enforce constraints from precedenting studies when handling the new data. Another challenge in phylogenetics that is more inherent than the first challenge, is the heterogeneity of evolutionary histories among different genes across species, populations, or even individuals. This discordance among histories has been modeled by multi-species coalescent model (MSC), which has been adopted by several tree inference tools. This dissertation will discuss methods developed for considering the inherent heterogeneity of phylogenetic data, as modeled by the MSC model. The focus of the dissertation is overcoming described challenges and introducing scalable algorithms that can detect boundaries and relationships of species from phylogenetic data.In Chapter 2, I introduce a new method for updating a species tree with new sequences, aiding in utilizing existing phylogenies when inferring new species trees. This method, called INSTRAL, can update a backbone species tree with one new species at a time or several in parallel. Thus, while scalability is achieved, the relationships of new species are not retrieved, and post-processing is needed to obtain fully resolved species trees with no ambiguity. Chapter 3 of the dissertation introduces another method for updating phylogenies with multiple new species, that also obtains the relationships among new species. This method in effect is a constrained species tree inference method, as it creates a constraint-compatible species tree from the input constraint tree and the set of gene trees. Constraints can come in several forms, and one form of constraint is the monophyly of individuals of a species in the species tree. In chapter 4, a summary method for creating a species tree from multi-individual data is introduced that can infer species tree following the constraint of monophyly of individuals of each species. However, these species boundaries are not always known a priori. Species delimitation is a challenging task by itself, and in Chapter 5, I describe a method for species delimitation based on gene tree topologies that is scalable to large datasets with thousands of genes. Finally, in Chapter 6, I describe a scalable quartet-based method to co-estimate gene trees and the species tree simultaneously to infer species tree more accurately.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View