Protein Identification via Assembly of Tandem Mass Spectra
- Author(s): Guthals, Adrian Lewis
- et al.
High-throughput proteomics is made possible by a combination of modern mass spectrometry instruments capable of generating many millions of tandem mass (MS² or MS/MS) spectra on a daily basis and the increasingly sophisticated associated software for their automated identification. Despite the growing accumulation of collections of identified spectra and the regular generation of MS² data from related peptides, the mainstream approach for peptide identification is still the nearly two decades old approach of matching one MS² spectrum at a time against a database of protein sequences. These traditional approaches fail for the identification of spectra from unknown proteins such as antibodies or proteins from organisms with un-sequenced genomes. Furthermore, attempts to identify MS/MS spectra against large databases (e.g., the human microbiome or 6- frame translation of the human genome) face a search space that is 10-100 times larger than the human proteome, where it becomes increasingly challenging to separate between true and false peptide matches. First, we describe techniques to utilize networks of spectra from related peptides to rigorously compute the joint spectral probability of multiple spectra being matched to peptides with overlapping sequences, thus improving peptide identification by 30-62% against large search spaces. We then introduce methods that dramatically improve de novo sequencing of unknown proteins using novel spectral network assembly algorithms and incorporating alternative MS/MS acquisition protocols. Finally, we describe an interesting end-goal biological problem for which the described advances in de novo sequencing can usher in a new era of therapeutic drug discovery