Thanks to the advances in sequencing technologies in the last two decades, the set of available whole-genome sequences has been expanding rapidly. One of the challenges in phylogenetics is accurate large-scale phylogenetic inference based on whole-genome sequences. A related challenge is using incomplete genome-wide data in an assembly-free manner for accurate sample identification with reference to phylogeny. This dissertation proposes new scalable and accurate algorithms to address these two challenges. First, I present a family of scalable methods called TreeCluster for breaking a large set of sequences into evolutionary homogeneous clusters. Second, I present two algorithms for accurate phylogenetic placement of genomic sequences on ultra-large single-gene and whole-genome based trees. The first version, APPLES, scales linearly with the reference size while APPLES-2 scales sub-linearly thanks to a divide-and-conquer strategy based on the TreeCluster method. Third, I develop a solution for assembly-free sample phylogenetic placement for a particularly challenging case when the specimen is a mixture of two cohabiting species or a hybrid of two species. Fourth, I address one limitation of assembly-free methods---their reliance on simple models of sequence evolution---by developing a technique to compute evolutionary distances under a complex 4-parameter model called TK4. Finally, I introduce a divide-and-conquer workflow for incrementally growing and updating ultra-large phylogenies using many of the ingredients developed in other chapters. This workflow (uDance) is accurate in simulations and can build a 200,000-genome microbial tree-of-life based on 388 marker genes.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.