Algorithms for Retention Time Alignment and Reference Construction of Mass Spectrometry Data
Skip to main content
eScholarship
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Algorithms for Retention Time Alignment and Reference Construction of Mass Spectrometry Data

No data is associated with this publication.
Abstract

Mass spectrometry (MS) has been the main technology for high-throughput proteomics studies. However, even with the recent advancement, the reliability of using MS for quantitative proteomics measurements is still a concern, as this is affected by various factors such as experimental setting and laboratory environment difference. Thus, in large-scale proteomics, retention time (RT) alignment is an indispensable step since we would like to make MS runs comparable. To perform the task, many alignment methods have been proposed but they are usually not scalable for large proteomics data, which contains thousands or millions of runs. Some of the methods are fast but are complicated or not well designed for aligning a broad range of different tissues and samples, e.g. they require a user selection of an individual and appropriate run as the reference for other runs to be aligned to.In this dissertation, we propose a scalable approach for retention time alignment that can solve these issues. First, we leverage the availability of the large number of MS runs/samples from public proteomics data repositories to build reference RT values for high-frequency precursors. The reference RTs of these precursors can serve as anchor points, i.e. to establish base RT correspondences and guide RT alignments. Second, we propose an ultra-fast RT alignment using these constructed reference RTs to adjust the retention times of peptides in any input run to the reference time scale. Our alignment can scale for millions of MS runs and reduce retention time variation by 3.6 times on average and up to 6.8 times for aligning distant runs in evaluation datasets. Its variation reduction is 4.9-65.9% better than the baseline alignments. We also present reference construction and extension methods, which provide a comprehensive record of aligned features and reference RT ranges for all detections of all identified precursors. Thus, we can instantly access important statistics, e.g. precursor stability, overlap, feature expansion and compression, for a comprehensive interpretation of large-scale proteomics data.

Main Content

This item is under embargo until April 3, 2026.