The advent of nanopore sequencing technology represents a significant leap forward in the ability to read long fragments of DNA, up to 4M bases, surpassing the capabilities of traditional short-read sequencing methods that can read a few hundred bases. Despite its potential, nanopore sequencing is challenged by high error rates (5% − 15%). In this dissertation, we presents a comprehensive examination of various computational approaches to address these challenges and enhance the utility of nanopore sequencing technology in genomic analysis by using an underlying physics-based model of nanopore sequencers to guide our methods.
First, we describe a mathematical model that describes the “nanopore channel” which takes a DNA sequence as input and outputs observed current variations in a nanopore sequencer. This model accounts for impairments such as inter-symbol interference, insertions-deletions, channel fading, and random responses. Moreover, the model also provides insights for the error profiles in the nanopore sequencer that can be utilized to develop algorithms for downstream applications. We further study the bounds on the information extraction capacity of nanopore sequencers that provide benchmarks for existing base-calling algorithms and guidelines for designing improved nanopores.
Our first main algorithmic work introduces QAlign, a preprocessing tool that improves the accuracy and efficiency of long-read aligners by converting nucleotide reads into discretized current levels. This transformation captures the error characteristics of nanopore sequencers studied in the previous work, enhancing alignment rates of nanopore reads to reference from around 80% to 90%, improving overlap quality for read-to-read alignments, and read-to-transcriptome alignment rates significantly across multiple datasets.
Our second main algorithmic work focuses on the detection of structural variants (SVs) using nanopore sequenced reads. We present HQAlign, an aligner designed to leverage the physics of nanopore sequencing and SV-specific modifications to enhance alignment accuracy. HQAlign demonstrates a 4% − 6% improvement in detecting complementary SVs compared to the minimap2 aligner, along with substantial improvements in breakpoint accuracy and overall alignment rates for read-to-reference alignments as compared to QAlign and minimap2.
The final algorithmic work addresses the challenge of identifying heterozygous variants using the highly erroneous nanopore reads data for developing algorithms for diploid genome assembly. We propose an algorithm that identifies heterozygous variants with a recall of 90% and precision of 70%, facilitating the reconstruction of diploid genomes without additional reference information or preliminary draft assemblies.
Collectively, these studies advance the understanding and application of nanopore sequencing technology, offering novel computational methods to mitigate high error rates and improve genomic analyses, including alignment, structural variant detection, and diploid genome assembly.