Base-Calling of High-Throughput Sequencing Data Using a Random Effects Mixture Model
- Author(s): Cacho, Ashley
- Advisor(s): Cui, Xinping
- Yao, Weixin
- et al.
The emergence of high-throughput sequencing (HTS) technology has greatly influenced research in biological sciences including clinical applications such as in the understanding of disease etiology and pharmacogenomics. One widely used sequencing machine is the Illumina platform which uses a novel sequencing-by-synthesis method that involves chemical and optical imaging processes. The conversion of fluorescence intensity measures resulting from image processing to nucleotide bases is what is known as base-calling. The complex nature of sequencing-by-synthesis generates biases that affect accuracy of the sequenced DNA. Consequently, further analysis of sequences such as in genome assembly and variant detection may be directly influenced. Considering recently published methods to perform base-calling, it is evident that many methods perform transformations to the intensity data to reduce and/or eliminate biases. Thus, there is a need to model the original intensity data to maintain the information inherent within the data. Our novel method based on a Random Effects Mixture model, REMix, aims to capture the sequencing process while using the original data provided by the sequencing machine. Real data results demonstrate that REMix has the best balance of performance with respect to the validation metrics that are considered.