Integrating protein similarity networks and orthogonal information for understanding protein origins and function
- Author(s): Barber, Alan Edgel
- Advisor(s): Babbitt, Patricia C
- et al.
Biology's entrance into the genomic age has meant dramatic changes. Biologists once carried out painstaking, low-throughput experiments, but now often rely on massive high-throughput experimental centers and `big data'. In modern biology, the quantity of data scientists can create vastly outstrips their corresponding ability to analyze and understand its full meaning. This means that one of most pressing current challenges is to create methods that can manage, organize and visualize massive datasets with the goal of assisting biologists in creating and testing hypothesis.
The computational solution presented in this dissertation is that of the protein similarity network (PSNs) and its implementation and usage. These networks are constructed by using an all-by-all pairwise comparison of a protein entity or feature, of which a network can be visualized. These networks assist in showing proteins of interest within their context, whether it is in a sequence, structure or functional context; and in creating hypothesis about how the data of interest relate to the much larger whole.
First, Pythoscape will be presented which is a novel software framework for the creation, modification and output of large PSNs. It will be described along with an overview and description of the architecture of the framework, as well as an example using the glutathione transferase superfamily to show the power of the framework in investigating the sequence and structure relationships of large protein superfamilies.
Second, an application of Pythoscape to the alkaline phosphatase superfamily is presented. PSNs are used to generate evolutionary hypothesis for this large protein superfamily. These networks, in conjunction with phylogenetic trees, are used to propose an evolutionary model that can annotate protein function more accurately and which also demonstrates the complexity of evolution in large mechanistically diverse enzyme superfamilies.
Finally, an application of Pythoscape to the kinase superfamily is presented. We use PSNs to study how members of this superfamily are targeted by caspases, proteases that are activated during apoptosis. This preliminary research demonstrates that sequence similarity and function do not always track and that other orthogonal sources of information may be necessary for accurate annotation.