Both linguistics and biology face scientific questions that require reconstructing phylogenies and ancestral sequences from a collection of modern descendants. In linguistics, these ancestral sequences are the words that appeared in the protolanguages from which modern languages evolved. Linguists painstakingly reconstruct these words by hand using knowledge of the relationships between languages and the plausibility of sound
changes. In biology, analogous questions concern the DNA, RNA, or protein sequences of ancestral genes and genomes. By reconstructing ancestral sequences and the evolutionary paths between them, biologists can make inferences about the evolution of gene function and the nature of the environment in which they evolved.
In this work, we describe several probabilistic models designed to attack the main phylogenetic problems (tree inference, ancestral sequence reconstruction, and multiple sequence alignment). For each model, we discussing the issues of representation, inference, analysis and empirical evaluation.
Among the contributions, we propose the first computational approach to diachronic phonology scalable to large scale phylogenies. Sound changes and markedness are taken into account using a flexible feature-based unsupervised learning framework. Using this model, we attacked a 50-year-old open problem in linguistics regarding the role of functional load in language change. We also introduce three novel algorithms for inferring multiple sequence alignments, and a stochastic process allowing joint, accurate and efficient inference of phylogenetic trees and multiple sequence alignments.
Finally, many of the tools developed to do inference over these models are applicable more broadly, creating a transfer of idea from phylogenetics into machine learning as well. In particular, the variational framework used for multiple sequence alignment extends to a broad class of combinatorial inference problems.