Bonus Organisms in High-Throughput Eukaryotic Whole-Genome Shorgun Assembly
The DOE Joint Genome Institute has sequenced over 50 eukaryotic genomes, ranging in size from 15 MB to 1.6 GB, over a wide range of organism types. In the course of doing so, it has become clear that a substantial fraction of these data sets contains bonus organisms, usually prokaryotes, in addition to the desired genome. While some of these additional organisms are extraneous contamination, they are sometimes symbionts, and so can be of biological interest.Therefore, it is desirable to assemble the bonus organisms along with the main genome. This transforms the problem into one of metagenomic assembly, which is considerably more challenging than traditional whole-genome shotgun (WGS) assembly. The different organisms will usually be present at different sequence depths, which is difficult to handle in most WGS assemblers. In addition, with multiple distinct genomes present, chimerism can produce cross-organism combinations. Finally, there is no guarantee that only a single bonus organism will be present. For example, one JGI project contained at least two different prokaryotic contaminants, plus a 145 KB plasmid of unknown origin.We have developed techniques to routinely identify and handle such bonus organisms in a high-throughput sequencing environment. Approaches include screening and partitioning the unassembled data, and iterative subassemblies. These methods are applicable not only to bonus organisms, but also to desired components such as organelles. These procedures have the additional benefit of identifying, and allowing for the removal of, cloning artifacts such as E.coli and spurious vector inclusions.