Efficient Methods for Analysis of Ultra-Deep Sequencing Data
Thanks to continuous improvements in sequencing technologies, life scientists can now easily sequence DNA at depth of sequencing coverage in excess of 1,000x, especially for smaller genomes like viruses, bacteria or BAC/YAC clones. As “ultra deep” sequencing becomes more and more common, it is expected to create new algorithmic challenges in the analysis pipeline.
In this dissertation, I explore the effect of ultra-deep sequencing data in two domains: (i) the problem of decoding reads to bacterial artificial chromosome (BAC) clones and (ii) the problem of de novo assembly of BAC clones. Using real ultra-deep sequencing data, I show that when the depth of sequencing increases over a certain threshold, sequencing errors make these two problems harder and harder (instead of easier, as one would expect with error-free data), and as a consequence the quality of the solution degrades with more and more data.
For the first problem, I propose an effective solution based on “divide and conquer”: the method ‘slices’ a large dataset into smaller samples of optimal size, decodes each slice independently, and then merges the results. For the second problem, I show for the first time that modern de novo assemblers cannot take advantage of ultra-deep sequencing data. I then introduce a new divide and conquer approach to deal with the problem of de novo genome assembly in the presence of ultra-deep sequencing data.
Finally, I report on a novel computational protocol to discover high quality SNPs for cowpea genome. I show how the knowledge of approximate SNP order can be used to order and merge BAC clones and WGS contigs.