Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

Efficient Mapping-Free Methods for Discovery and Genotyping of Structural Variations

Abstract

Structural variants (SVs) account for a large amount of sequence variability across genomes and play an important role in human evolution and diseases. Despite massive efforts over the years, the discovery of SVs in individuals remains challenging due to the highly repetitive nature of the human genome and the existence of complex SVs. The dominant mapping-based framework for SV discovery has several drawbacks including dependence on resource intensive mapping algorithms and an increased possibility of error in repetitive regions of the genome due to ambiguous read mappings. As a result, new computational methods are needed that can genotype different types of SVs in both short and long read data with high accuracy.

In this thesis, we first propose an ultra-efficient mapping-free approach for genotyping common structural variations on short Illumina reads in Chapter 2. Our method Nebula generates databases of k-mers for catalogs of common SVs and counts these k-mers in unmapped samples to predict SV genotypes using a likelihood model of the k-mer counts. Nebula is the first method known to us that's capable of directly mapping-free SV genotyping from raw FASTQ files. We show that Nebula is not only an order of magnitude faster than mapping-based approaches for genotyping SVs, but it also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework that is not limited to specific types of SVs.

Next we introduce the concept of substring-free sample-specific strings (SFS) as an effective tool for comparative variant discovery between pairs of accurate long-read sequencing samples (e.g., PacBio HiFi) in Chapter 3. The SFS are sequences specific to a genome (or equivalently its sequencing reads) with regards to another genome that also do not occur as substrings of one another. We then introduce the Ping-Pong algorithm for theoretically and practically efficient extraction of SFS between a pair of target and reference samples by building an FMD index of the reference sample and querying the reads of the target sample against this index. Ping-Pong is a mapping-free method and is therefore not hindered by the shortcomings of the reference genome and mapping algorithms. We show that Ping-Pong is capable of accurately finding SFS representing nearly all variation (>98%) reported across pairs or trios of WGS samples using PacBio HiFi data.

Finally in Chapter 4 we introduce SVDSS, a novel hybrid method for discovery of SVs from PacBio HiFi reads that combines the SFS concept with partial-order alignment (POA) and local assembly to yield highly accurate SV predictions. With experiments on three human samples, we show that SVDSS outperforms state-of-the-art methods for SV discovery on long-read data and achieves significant improvements in recall and precision particularly when discovering SVs in repetitive and traditionally difficult regions of the genome.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View