New methods for inferring, assessing, and using phylogenetic trees from genomic and microbiome data
Phylogenies are trees showing the evolutionary relationship among species, and reconstructing phylogenies using molecular data can be framed as an optimization problem. Recent advances in DNA sequencing have resulted in extensive application of phylogenetic inference to (meta)genomic data. However, the scale and the complexity of the data has presented researchers with new algorithmic and statistical challenges, in particular, difficulties in noise reduction and statistical support estimation. This dissertation addresses these challenges.
A significant challenge in using genomic data for phylogenetics (phylogenomics) is inconsistencies between evolutionary histories across different parts of the genome. Thus, phylogenomics methods need to consider these inconsistencies. One scalable solution is using summary methods, where a tree is first inferred for each gene, and then gene trees are summarized to build the species tree. Chapter 2 of this dissertation is dedicated to presenting a scalable and accurate summary method called DISTIQUE for reconstructing species trees from gene trees.
A major challenge in phylogenomics is the interpretation of inferred phylogenies, especially in the presence of noise and gene-tree inconsistencies. Biologists rely on measures of statistical support for interpreting branches of the phylogeny. Chapter 3 introduces a highly scalable and reliable Bayesian measure of support, localPP, and Chapter 4 introduces a frequentist version of localPP for performing hypothesis testing.
When using any summary method, the quality of the inferred species tree is highly impacted by the quality of gene phylogenies. In Chapter 5, we identify one factor that reduces the gene tree accuracy (gene fragmentation) and introduce a filtering strategy that effectively reduces error in gene trees and species trees. Further, Chapter 6 introduces a visualization framework, DiscoVista, to assist biologists in interpreting potentially discordant phylogenetic results.
The final chapter focuses on the use of phylogenies in microbiome studies, where the goal is analyzing genetic material from environmental samples and to infer associations of genotype to phenotypical properties of samples. A main challenge in microbiome analyses is the huge variability across samples and small sample sizes. Chapter 7 introduces TADA, a new phylogeny-based method of data augmentation that improves the accuracy of classification methods applied to microbiome data.