Models of the evolution of DNA sequences typically assume that each position of the sequence evolves independently of all others. This assumption is unrealistic in most cases and is made either for simplicity, computational tractability, or because the nature of the dependence may not be well understood. Proteins and RNAs present instances in which the three dimensional structure of the molecules are essential for function, and introduce dependence among sites in clearly defined ways. Here I explore models that can account for dependence among sites, use them to explore the evolution of DNA sequences containing dependence both within a population and between species, and develop a new substitution model that can be used to make inferences about the strength of natural selection acting on these sequences.
In the first chapter I demonstrate the importance of accounting for dependent evolution among sites for phylogenetic inference. Using a realistic model of the evolution of proteins and RNAs based on known structures, I simulate the evolution of DNA sequences in which the evolution at each site can depend on many other positions in the sequence. Using these simulated data I show that phylogenetic methods that assume sites evolve independently are impaired in their ability to infer the true topology relating the species, and I quantify the error in this estimation as a function of the strength of the dependence, the tree length, the topology, and the specific type of molecular structure. This underscores the importance of accounting for such dependent evolution among sites in studies of molecular evolution.
In the second chapter I explore the dynamics of the substitution process within a population rather than between species. One of the central questions when accounting for epistatic interactions among sites is how two changes, which when taken together are neutral, can spread in a population when a single change in isolation is deleterious. This process of compensatory evolution has been explored by population genetics theory in the case when natural selection acting against the intermediate state is very strong. Here I explore the case in which natural selection against the intermediate states is moderate to weak using forward time population genetic simulations of the simplest possible case of two dependent sites. I show that when selection is weak the two substitutions can be made one at a time, that as selection increases the substitutions are made more frequently in tandem, and how these patterns are functions of population size, mutation rate, and recombination.
In the third chapter I utilize the insights about the dynamics of compensatory evolution within a population from the second chapter to reexamine the evolution of dependent sites between species. I develop a new substitution model for the analysis of RNA that accounts for the probability of the different pathways to compensatory substitution. This model is interpretive, in that parameters have direct meaning with respect to the strength of natural selection acting against deleterious intermediate states. I implement this model in a Bayesian framework for parameter estimation, and demonstrate its utility for making inferences about historical selective pressures on RNA sequences using a 5S ribosomal RNA dataset. This represents the first probabilistic evolutionary model that both accounts for dependent evolution among sites and connects population genetic dynamics with substitution patterns between species.
Taken together, these studies reveal a great deal about the nature of the evolutionary process when sites are not independent. They explore these processes both within a population and between species, and then use insights from one to better inform the other, attempting to connect these two historically separate approaches to the study of evolution. The advances here are not limited to RNA and proteins, but are generally applicable to any instance in which epistatic interactions can be found, from speciation genetics to the evolution of functional morphology.