Efficient and accurate bioinformatics algorithms for peptide mass spectrometry
- Author(s): Tanner, Stephen Will
- et al.
Peptide tandem mass spectrometry has emerged as a key technology to detect and measure proteins in biological systems. A core problem is the interpretation of tandem mass spectra. These spectrum annotations are then used to study post-translational modifications, disease biomarkers, protein-protein interactions, and subcellular localization. Technological breakthroughs have led to the generation of ever-increasing volumes of data. Experiments generating tens of millions of spectra are routine, and require efficient algorithms to be effectively analyzed. Filters using sequence tags, as implemented in the InsPecT software toolkit, allow spectra to be rapidly searched against a large proteomics database. The MS-Alignment algorithm addresses the still more challenging problem of interpreting mass spectra in the presence of unanticipated modifications. A key consideration is the efficient handling of large data volumes without the need for manual intervention. In any high-throughput biological experiment, calculation of a false discovery rate is essential. The use of a decoy database of shuffled proteins is emerging as a key method for measuring false discovery rates. In addition, decoy database allows a direct comparison of the quality of results from different search parameters, instrument settings, or software tools. We adopt a principled approach to correcting or filtering spurious annotations and experimental artifacts. A key idea is the focus on error rate, not at the level of individual spectra, but at the level of distinct peptides, or modification sites. Results at this higher level can be made more accurate by integrating data across mass spectra. Additional research is presented on the analysis of RNA microarrays. Here the goal is the identification of gene sets - such as members of a pathway - which are up- or down-regulated. Here the fundamental data is differential transcription levels, as measured by a t statistic. As with mass spectra, leveraging separate measurements (expression levels across many genes) improves accuracy. And computing the false positive rate in a principled way, with an appropriate null model, is vital for computing valid p-values