3D Motifs as Signatures of Protein Function and Evolution
The ability to predict a protein's function from its structure is becoming more important with the increasing pace at which international structural genomics projects make structures available for proteins with no known function. The function of a protein is frequently determined by relatively small regions in an overall structure. This dissertation investigates signature 3D motifs, or small subsets of a protein's residues, that capture the critical structural determinants of function shared by an entire group of proteins. First, with an investigation of randomly selected 3D motifs I show that motifs built from important functional residues are better at identifying proteins to a superfamily with a common functional mechanism than any other motifs. Next I develop a genetic algorithm, named GASPS, that chooses a motif based on its ability to identify a group of proteins. I demonstrate its effectiveness on four divergent superfamilies, and a convergent group of serine proteases. Again, I demonstrate that the best motifs, as chosen by GASPS this time, contain known functional residues. Chapter 3 investigates the use of a geometrical statistical model to predict the number of expected random matches to a motif. This simple geometrical model performs very well overall, but it under-predicts matches to motifs that are the result of general physical and chemical characteristics of proteins, such as disulfide bridges and hydrophobic clusters. This model is rejected for its use in GASPS in favor of the original empirical method. Finally, I report a broad survey of signature 3D motifs, generated by applying GASPS to all available functionally similar and homologous groups of proteins. Motifs are mostly restricted to homologous groups, with a higher chance of a better motif in homologous and isofunctional groups. I report on general trends in structural conservation and find that catalytic, ligand binding, disulfide, and stabilized charged residues are over-represented among conserved motifs. Additionally, I find that glycines appear to be the most frequently conserved residue, especially important in ligand binding sites. This collection of motifs is useful for identification of function in unknown proteins, as well as describing trends in protein evolution.