Phylogenetics has been widely adopted across biology. Yet, a continuing difficulty in phylogenetics is modeling all biological processes that shape evolution while maintaining computational scalability. My dissertation focuses on several problems, in each case, developing scalable algorithms that advance biological realism. Much of the dissertation focuses on species tree reconstruction confronting discordance among evolutionary histories of genes (gene trees) for biological reasons such as incomplete lineage sorting.
Past work had already developed statistically consistent methods such as ASTRAL for species tree reconstruction given gene trees. However, these methods failed to account for gene tree error (GTE). Contracting low-support branches was a potential solution, but ASTRAL was not efficient in handling polytomies. Here, I introduce ASTRAL-III, which drastically reduces the computational complexity in handling polytomies and improves robustness to GTE. Not satisfied with the need for a contraction threshold, I also introduce weighted ASTRAL, a method that down-weights error-prune gene tree branches and further improves the accuracy. Furthermore, I propose a method called ASTERISK to infer the species tree directly from multi-sequence alignments (MSAs), forgoing the need to infer error-prone gene trees. Having dealt with gene tree errors, I turn to errors in MSAs, which can impact phylogenetic analyses. I introduce TAPER, a novel two-dimensional outlier detection algorithm that looks for errors in small species-specific stretches of MSAs. TAPER can reduce GTE by finding much of the error while removing very little data.
Another shortcoming of ASTRAL was that it failed to model gene duplication and loss (GDL). I present a new algorithm called ASTRAL-Pro to accommodate datasets with high GDL rates, showing that ASTRAL-Pro is more accurate than alternatives.
Finally, I turn to selective pressure, a process that phylogenetics often fails to model. To benchmark the performance of tools under selection, I develop DIMSIM, an efficient simulator for sequence evolution under selection. I apply DIMSIM to the B-cell affinity maturation process that involves somatic hypermutations to B-Cell sequences followed by selective pressure. My study reveals that phylogenetic reconstruction tools fail to capture key features of clonal tree expansion if applied naively but can be easily rescued by contracting short branches.