Skip to main content
Open Access Publications from the University of California


UC San Francisco Electronic Theses and Dissertations bannerUCSF

Exploring the Protein Universe from General Principles


This dissertation is concerned with the construction and validation of an organizational framework for processing large protein sequence datasets. The framework relies on the accurate clustering of input sequences into functionally similar families. We demonstrate how the quality of output for existing protein clustering techniques may be improved by running a simple edge weight selection heuristic prior to clustering. Once clustering is completed, we are able to topologically organize the data by treating each cluster as a node in a network and searching for the union of minimum spanning trees that reconnects the clusters to each other. When thusly organized, the topological relationships between neighboring clusters exhibit properties similar to evolutionary relationships computed from phylogenetic models. We demonstrate how these topological relationships may be used to algorithmically identify the functionally significant residues within the sequences in the organized dataset. This predictive capacity of the organizational framework serves as a quantitative metric for validating the framework's biological significance.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View