Statistical Algorithms for High-throughput Biological Data /
- Author(s): Jeong, Kyowon
- et al.
Recent advances in high-throughput technologies, such as tandem mass spectrometry (MS/MS) and next generation sequencing (NGS), have enabled the acquisition of huge amount of biological data containing whole genome/proteome scale information. However, due to their huge size and complexity, the interpretation of such data has become the bottleneck for further biological applications; many related computational algorithms and standardized statistical methods are still missing. Therefore, the development of efficient statistical algorithms has become essential to analyze and access massive biological data. In this dissertation, statistical algorithms for the peptide identification via MS/MS spectra and the somatic mutation profiling via NGS read data are presented. Peptide/protein identification via mass spectrometry is an important task for proteomics studies. Two most widely used approaches are the database search and the de novo peptide sequencing. We first present UniNovo, a universal de novo peptide sequencing algorithm that works well for various types of spectra from different experimental protocols and MS instrument configurations. Next we introduce MS-GappedDictionary, an algorithm that enables fast and sensitive searches of huge proteome databases (which have been prohibitively time consuming with existing approaches) using de novo sequences generated from tandem mass spectra. Lastly we present a statistical method to validate the accuracy of false discovery rate (FDR) estimation in database searches and suggest a standard method for more accurate estimation of FDRs. The later part of this dissertation focuses on the somatic mutation profiling via NGS read data. The goal of the somatic mutation profiling is to identify genetic alterations that occur after conception, or somatic mutations. Since the somatic mutations can (but not always) cause cancer or other diseases, their identification is crucial for downstream disease studies. However, sensitive identification of somatic mutations is a hard task because they are extremely rare events (1-10 occurrences per 1 Mega base pairs). We introduce a novel algorithm for identifying somatic mutations which incorporates the possible contamination of biological samples into the model. Using both simulated and experimental datasets, we demonstrate that our algorithm has higher sensitivity than other state-of-the-art algorithms