Algorithms and Data Structures for de novo Sequence Assembly
- Author(s): Alhakami, Hind
- Advisor(s): Lonardi, Stefano
- et al.
Despite the prodigious throughput of the sequencing instruments currently on the market, the assembly problem remains very challenging, mainly due to the repetitive content of large genomes, uneven sequencing coverage, and the presence of (non-uniform) sequencing errors and chimeric reads. The third generation of sequencing technology such as Pacific Biosciences and Oxford Nanopore offers very long reads at a higher cost per base, but sequencing error rate is much higher. As a consequence, the final assembly is very rarely entirely finished, with one solid sequence per chromosome. Instead, the typical output is an unordered/unoriented set of contiguous regions called contigs. We examine two different but related problems in this study; merging multiple assemblies produced using different assemblers/parameters, and stitching assembled BACs to create a genome-wide assembly.
The contribution of this dissertation is twofold. First, compact encoding of finite sets of strings is a classic problem. The manipulation of large sets requires compact data structures that allow for efficient set operations. We defined sequence decision diagrams (SeqDDs), which can encode arbitrary finite sets of strings over an alphabet.
Second, reassembly of existing overlapping contigs with the intent to produce a higher quality genome-wide assembly. Second, merge multiple assemblies to produce a higher quality consensus is a compelling problem. We conducted a comparative study of state of the art assembly reconciliation tools, with the intent to use them in assembling a set of approximately four thosands Vigna unguiculata (cowpea) assembled BACs. To accomplish this task, we developed Colored-Positioned de bruijn graph, a variant of the classic de bruijn graph to stitch overlapped assemblies.
In this Dissertation we studied and developed data structures and algorithms to merge overlapping assemblies. In particular: (1) Introduced sequence decision diagrams (SeqDDs) to enable compact encoding of finite sets of strings that allow for efficient set operations, among which detecting overlaps. (2) carried a comparative study of state of the art assembly reconciliation tools. and (3) developed tools to cluster overlapped BACs and assemble said clusters. Our assembler implements colored-positioned de bruijn graph, an augmented variant of the classic de bruijn graph, defined in this study.