The problem of obtaining the full genomic sequence of an organism has been
solved either via a global brute-force approach (called whole-genome shotgun) or by a divide-and-conquer strategy (called clone-by-clone). Both approaches have advantages and disadvantages in terms of cost, manual labor, and the ability to deal with sequencing errors and highly repetitive regions of the genome. With the advent of second-generation sequencing instruments, the whole-genome shotgun approach has been the preferred choice. The clone-by-clone strategy is, however, still very relevant for large complex genomes. In fact, several research groups and international consortia have produced clone libraries and physical maps for many economically or ecologically important organisms and now are in a position to proceed with
sequencing.
We recently proposed a BAC-by-BAC sequencing protocol that combines combinatorial pooling design and second-generation sequencing technology to effciently approach denovo selective genome sequencing [30]. We showed that combinatorial pooling is a cost-efective and practical alternative to exhaustive DNA barcoding when preparing sequencing libraries for hundreds or thousands of DNA samples, such as in this case gene-bearing minimum-tiling-path BAC clones. The novelty of the protocol hinges on
the computational ability to effciently compare hundred of millions of short reads and assign them to the correct BAC clones (decoding) so that the assembly can be carried out clone-by-clone. In this thesis, we address the problem of decoding and error correcting pooled sequenced data obtained from such a protocol. Experimental results on simulated data for the rice genome as well as on real data for a gene-rich subset of the
barley genome show that our decoding and error correction methods are very accurate, and the resulting BAC assemblies have high quality.
While our method cannot provide the level of completeness that one would
achieve with a comprehensive whole-genome sequencing project, we show that it is quite successful in reconstructing the gene sequences within BACs. In the case of plants such as barley, this level of sequence knowledge is sufficient to support critical end-point objectives such as map-based cloning and marker-assisted breeding.