The Description, Prediction and Misprediction of Protein Function
- Author(s): Schnoes, Alexandra Maria
- Advisor(s): Babbitt, Patricia C
- et al.
Understanding protein function is a key biological question that is the focus of much current research. This question has become increasingly complicated in the post-genomic era with the influx of millions of new sequences. We are no longer able to experimentally characterize even a small fraction of these protein sequences, however this sequence data is a treasure trove of new biologically important information. Currently most sequences are annotated with protein functions predicted through computational methodologies. Computational prediction poses a number of issues, however. The concept of `protein function' is nebulous in practice and can be defined innumerable different ways depending on the context. Methods for computational prediction all contain some level of inherent error leading to an unknown level of functional misannotation in sequence databases. The common practice of annotating functions through the process of annotation transfer (the transfer of an annotation from a `known' sequence to an `unknown' sequence typically determined by sequence similarity) is potentially compounding the misannotation problem via the inclusion of new errors and the propagation of errors already in the sequence databases. The thesis work here presented provides an in depth analysis of the issue of protein misannotation. The levels of misannotation for 37 different enzyme families are examined in four public sequence databases (GenBank NR, UniProtKB/TrEMBL, KEGG and UniProtKB/Swiss-Prot) and shown to be significantly high. The types of misannotations found are detailed to provide some insight into the possible causes. Additionally, the directional trend in misannotation over time is determined to be increasing. In conclusion, methodologies to ameliorate the misannotation problem are discussed and several proposals to quickly find new misannotations and communicate these findings to the scientific community are made.