The birth-death process has been used to study the evolution of a wide variety of biological entities from genes to species. Much recent work has turned to detecting changes in the patterns of lineage splitting by comparing data to birth-death models in which the parameters vary between lineages or over time. Here, I develop methods to investigate how the birth-death process varies under three very different circumstances: changes in the pattern of taxon diversification through time; the effect of whole genome duplications on the pattern of chromosome gain and loss; and changes in the pattern of gene gain and loss on branches of a taxon tree. For all three cases I apply my methods to some real data.
For the last fifteen years researchers have studied the distribution of branching times of a phylogeny of extant taxa in order to detect temporal changes in the process of diversification. Theoretical work on this subject has been based on different implementations of the birth-death process and has proceeded along three basic lines: the comparison of actual branching times to a birth-death process; the inference of the effects of different birth-death processes on the distribution of branching times; and the derivation of analytical results that describe various aspects of different birth-death processes. In chapter 2 I make contributions to all three lines of research for the reconstructed time variable birth-death process.
Previous work had shown how to calculate the distributions of number of lineages and branching times for a reconstructed constant rate birth-death process that started with one or two reconstructed lineages at some time or ended with some number of lineages in the present. In chapter 2 I expand that work to include any time variable birth-death process that starts with any number of reconstructed lineages and/or ends with any number of reconstructed lineages at any time. I also introduce the discrete time birth-death process which operates as an efficient and accurate numerical solution to any time-variable birth-death process and allows for the analytical incorporation of sampling and mass extinctions. Furthermore, I show how to simulate random trees under any of these models.
In order to compare phylogenetic trees to these models, I use these methods to calculate two statistics that describe the effect of a set of branching times to any time variable birth-death model: the maximum likelihood, which can be compared to the distribution of the maximum likelihood for a random sample of trees or to that the maximum likelihood of other birth-death models using the Akaike Information Criterion; and the Komolgorov-Smirnov test, which is based on the fact that the branching times should be independently and identically distributed under many time variable birth-death models. I also demonstrate two new methods for visualizing the distribution of branching times: the lineage through time null plot uses a heat map to show the distribution of the number of lineages at different times; and the waiting time null plot does the same for waiting times between branching times. These plots can be used either to see how different time variable birth-death processes affect these distributions or to compare a data set to any time variable birth-death process. I use all these methods to analyze two data sets of reconstructed taxon branching times.
The study of paleopolyploidies requires the comparison of multiple whole genome sequences. If researchers could identify the branch of a phylogeny on which a whole genome duplication occurred, before sequencing the genomes of multiple taxa, then they could select taxa that would give them a better picture of that whole genome duplication. In chapter 3 I describe a likelihood model in which the number of chromosomes in a genome evolves according to a Markov process with three stochastic rates: a rate of chromosome duplication and a rate of chromosome loss that are proportional to the number of chromosomes in the genome; and a rate of whole genome duplication that is constant. I implemented software that calculates the maximum likelihood under this model for a phylogeny of taxa in which the chromosome counts are known. I compared the maximum likelihoods of a model in which the genome duplication rate varies to one in which it is fixed at zero using the Akaike information criterion, in order to determine if a model with whole genome duplications is a good fit for the data. Once it has been determined that the data does fit the model, we infer the phylogenetic position of paleopolyploidies by using this model to calculate the posterior probability that a whole genome duplication occurred on each branch of the taxon tree. I applied this model to a phylogeny of 125 molluscan taxa and inferred three places on that phylogeny where it is very likely that a whole genome duplication occurred: a single branch within the Hypsogastropoda; one of two branches at the base of the Stylommatophora; and one or two branches near the base of Cephalopoda.
Thanks to the wealth of readily available comparative genomic data, it has become apparent that gene family expansion and contraction is critical for the evolution of organisms. Several researchers have developed likelihood methods that use counts of genes in gene families from a number of taxa to deduce on which branches of the phylogenetic tree there has been an unusual amount of gene duplication or gene loss in that gene family. Gene family counts are readily available, but there is a great deal of information in the gene family tree that is unavailable when using gene counts alone. In chapter 4, I develop a method that uses the gene family tree to infer changes in the process of gene gain and loss on a taxonomic tree. This method relies on calculating the probability of a gene tree given a taxon tree and a set of birth-death parameters by which that gene tree evolves on the taxon tree. I use a reversible-jump MCMC to sample from the joint posterior distribution of a set of birth-death parameters and assignments of those parameters to the branches of a taxon tree given a gene tree and a taxon tree. Different assignments are compared using Bayes factors. I use simulations to show that this method has much more power than a method which relies only on counts of gene family members to determine if a gene family evolved by a different process on a pair of taxon branches, and whether that difference is a consequence of differences in the birth rate or the death rate.
In section 4.5 I expand my method to include uncertainty in the gene tree topology, by using a set of gene alignments as my data rather than the fully resolved gene tree. Under this implementation I calculate the probability of those sequences given the gene tree, in addition to the probability of the gene tree given the taxon tree. I modify the reversible-jump MCMC so that it now samples from the posterior distribution of the nucleotide evolution parameters and the gene trees, in addition to the birth-death parameters and their assignments to the branches of the taxon tree. I demonstrate the use of this method on two real gene families found in the Bilateria. I found that a clade of 46 protein tyrosine kinase genes from three taxa is characterized by an increase in the gene duplication rate on the branch leading to Caenorhabditis elegans. Furthermore, a separate analysis of all the posterior hox genes from nine taxa implies that their evolution has been characterized by massive gene loss throughout the Bilateria with a lower rate of turn over in the chordates and at the base of the deuterostomes than is found in the protostomes or in the echinoderms.