The number of known protein sequences is growing faster than the number of curated protein functions. To help bridge this gap, bioinformatics scientists have created automated methods for the prediction of protein function. Recently, the focus has been on integrating numerous data sources, and critical evaluation of these methods show that the integrative approach improves predictive performance. However, a basic BLAST-based method is still a top contender.
Computational biologists often use two complimentary approaches to infer functions that are usually more accurate than a BLAST-based method. Analysis of sequence similarity networks can dissect protein functions in a superfamily and infer the function of individual proteins. Briefly, a computational biologist will create a network of proteins in sequence space, which typically shows clusters of similar proteins. She will then highlight which few of these proteins have experimental functional annotations, and paint the network according to other functional features that are broadly available, such as residues in key positions in an alignment. These data are used to identify proteins where a functional change may have occurred, which then can be used to delineate protein families or other protein groups that share a specific function or functional characteristic. However, molecular functional annotation data are very scarce, and there is not enough of it to draw functional boundaries with high confidence.
The second method, analysis of genomic context, is often done in conjunction with sequence similarity network analysis. This approach uses data about the genome neighbors of a protein, or more generally, any functional association data, such protein -- protein interaction data, to predict a protein's molecular function. This technique has been used to refine functional boundaries during sequence similarity network analysis, as well as to generate hypothesis in the absence of characterization of any close homologs.
In this dissertation, I describe Effusion, our attempt to automate sequence similarity network analysis and improve on the current methods for the prediction of protein function. Effusion modernizes the classical BLAST-based approach while avoiding pitfalls common to state-of-the-art methods. It uses a sequence similarity network to add context for homology transfer, a probabilistic model to account for the uncertainty in labels and function propagation, and the structure of the Gene Ontology to best utilize sparse input labels and make consistent output predictions. Effusion's model makes it practical to integrate rare experimental data with the abundant primary sequence and sequence similarity data. Our model allows for inference with general purpose, state-of-the-art inference algorithms, makes use of all experimental annotation data, has parameters specific to each Gene Ontology term, and adds data-derived pseudocounts to predict rare terms.
Effusion GCA extends Effusion by integrating the chief components necessary for automating genomic context analysis. It performs its analysis over a sequence similarity -- functional association network, with a model of protein function that includes a representation of each protein's biological process, performs simultaneous inference on multiple aspects of protein function, and only propagates functional information where it is appropriate.
We assessed our methods using a critical evaluation method and metrics. The results show that Effusion outperforms standard prediction methods, the most similar prediction methods, and state-of-the-art prediction methods. Effusion GCA does not perform as well as Effusion in aggregate, but offered several other insights. We conclude that these methods represent a significant progress in the field of protein function prediction, and clearly suggest avenues for further advance.