Advances in computational mass spectrometry : phosphoprotoemics and proteogenomics
- Author(s): Payne, Samuel Harris
- et al.
The proteome is a dynamic group of proteins, interacting with and modifying each other in response to the environment. Tandem mass spectrometry has become the most convenient and high-throughput means of assaying the proteome. Modern instruments are capable of generating data for tens of thousands of peptides from thousands of proteins in a single experiment. In this work we present two important applications on proteomics: phosphoproteomics and proteogenomics. Protein signaling is dominated by reversible phosphorylation. Understanding which proteins are phosphorylated, when, where, and by whom is key to understanding most cellular signaling. A variety of obstacles make assaying phosphopeptides with tandem mass spectrometry a difficult task. First, phosphorylation is reversible and transitory. Therefore, although many proteins can be phosphorylated, very few are phosphorylated at any given time. Moreover, the phosphorylation event may be sub-stoichiometric. Thus a small fraction of peptides in a proteomic sample are phosphorylated. Experimental mass spectrometrists have overcome this with the adoption of phosphopeptide enrichment protocols. A sample containing perhaps 1% phosphopeptides can be purified to over 90% phosphopeptides. However, even with a high concentration of phosphorylated peptides, phosphoproteomics suffers from a second challenge, poor spectral quality. Spectra generated by phosphopeptides have low information content and are difficult to interpret. We present an approach for learning the features of phosphopeptide spectra, and model these features in a Bayesian network. This probability model, when applied to the scoring function of Inspect, achieves a dramatic increase in sensitivity versus other peptide identification software. The second field of study presented in proteogenomics. The task of annotating the genome for protein coding genes is difficult, and requires substantial effort. Yet this is the arguably the most important outcome of the genomic era. Most annotation pipelines utilize nucleotide centric information, such as cDNA or homology to known genes, to refine their computational predictions. Unfortunately error rates are still suspected to be high, both in terms of genes which are mispredicted and genes which are wholly missing from the annotation. We present our work on utilizing peptides obtained from mass spectrometry to reannotate the genome. We collect a large corpus of MS/MS spectra from Arabidopsis thaliana and annotate spectra from 18,024 peptides which are not currently in the proteome. Using these peptides we present gene models for 778 genes missing from the current annotation, and refine or correct an additional 695 loci, showing that proteogenomics can dramatically improve the quality of a genome annotation