Exploring the Protein Universe from General Principles
This dissertation is concerned with the construction and validation of an organizational framework for processing large protein sequence datasets. The framework relies on the accurate clustering of input sequences into functionally similar families. We demonstrate how the quality of output for existing protein clustering techniques may be improved by running a simple edge weight selection heuristic prior to clustering. Once clustering is completed, we are able to topologically organize the data by treating each cluster as a node in a network and searching for the union of minimum spanning trees that reconnects the clusters to each other. When thusly organized, the topological relationships between neighboring clusters exhibit properties similar to evolutionary relationships computed from phylogenetic models. We demonstrate how these topological relationships may be used to algorithmically identify the functionally significant residues within the sequences in the organized dataset. This predictive capacity of the organizational framework serves as a quantitative metric for validating the framework's biological significance.