UC San Diego
Genome Assembly of Long Error-Prone Reads Using De Bruijn Graphs and Repeat Graphs
- Author(s): Yuan, Jeffrey
- Advisor(s): Pevzner, Pavel
- Ren, Bing
- et al.
Genome assembly is the problem of reconstructing genomes from DNA sequence reads. Even the best assemblies are often fragmented due to the presence of repetitive regions in the genome. Using long, single molecule sequencing (SMS) reads can improve the contiguity of these assemblies, but still fail to resolve long repetitive regions. Furthermore, the high error rate of SMS reads poses additional difficulties for assembly, raising the question of whether the popular de Bruijn graph (DBG) approach to genome assembly can be applied to SMS reads.
First, I present ABruijn, the first genome assembler for SMS reads that follows the DBG approach. By modifying the DBG into an A-Bruijn graph, ABruijn is able to produce very polished assemblies for simple genomes such as E. coli and S. cerevisiae. However, ABruijn has some difficulties with processing very repetitive regions and very large genomes.
To address ABruijn’s shortcomings, I helped to develop Flye, a DBG-based assembler for SMS reads that can be applied to large mammalian genomes such as the human genome. Flye features a much more efficient method for resolving highly repetitive regions and also generates a repeat graph, which offers a compact representation of all of the repeats in a genome. Flye further performs steps to resolve those repeats and improve the quality of the assembly, resulting in a more contiguous assembly of the human genome compared to other state-of-the-art assemblers.
Finally, I present diploidFlye, a haplotype-aware extension of Flye that is able to phase the contigs for assemblies of diploid organisms. diploidFlye takes advantage of the repeat graph generated by Flye to efficiently identify heterozygous variants and generate haplocontigs (haplotype-specific contigs) from the reads.
Overall, this dissertation presents several novel algorithms for improving the performance of the de novo genome assembly of long SMS reads, establishing the efficacy of the DBG approach even for error-prone SMS reads and developing a state-of-the-art assembler known as Flye with many novel features for improving the overall assembly.