Proteins evolve through a process of divergence from common ancestors and subsequent selection, and so the evolutionary relationships shared by any set of proteins will affect the relationships among their sequences and structures. Comparative modeling, a strategy to predict protein structures, takes advantage of these evolutionary relationships and uses a template protein's structure to predict the structure of an evolutionarily related target protein.
Evolutionary relationships between proteins can be complex and often are difficult to deduce. Comparative modeling typically relies on a single template, but a given protein has many more related proteins. Additionally, there are several different ways in which proteins can diverge through evolution. Gene duplication events produce paralogs, proteins that are generally less likely to maintain the same function than orthologs, which are produced by speciation events. The work in this dissertation focuses on how these complexities affect the sequence-structure relationships between proteins and on our ability to leverage those relationships to predict protein structure.
In the first study (Chapter 2), I examine effects of orthology and paralogy on relationships between pairs of proteins' sequence and structure. Using established methods for quantifying the sequence-structure relationships in proteins, this study examined differences in those relationships that are a consequence of the evolutionary relationships between pairs of proteins. I found that, at the same intermediate levels of sequence identity, orthologs are more structurally similar than paralogs. These results indicate that incorporating knowledge of orthology into comparative modeling could result in improved model accuracy.
In Chapter 3, I examine new ways of using multiple templates in comparative protein structure modeling. In this study, I established Modeller's baseline performance in modeling using multiple templates, compared the performance when using a new method of weighting templates, and established an upper bound on the performance improvement possible from changing the method for how template weights are determined. I found that using statistical potentials to weight the contributions from multiple templates instead of using sequence similarity did not provide significant improvements. These studies further suggested that Modeller's current weighting scheme is already performing close to optimally.