Leveraging Similar Regions to Improve Genome Data Processing
Though DNA sequencing has improved dramatically over the past decade, variant calling, which is the process of reconstructing a patient’s genome from the reads that the sequencers produce, remains a difficult problem, largely due to the genome’s redundant structure. In this thesis, we describe SiRen, our algorithm for characterizing the genome’s structure in a way that makes sense from the perspective of the reads themselves. We use the term similar regions to refer to the areas of redundancy that we have identified. We then confirm that the similar regions are characterized by low variant calling accuracy. We show that the structure of the similar regions provides a platform for repairing alignment errors, thus leading to significantly improved variant calling accuracy.