De novo genome and transcriptome assemblies of next generation sequences (NGS) are important for many genomics applications of unsequenced organisms. Both assembly types present many challenges owing to (i) the large amount of data to process, (ii) sequencing errors and (iii) the complexity of transcriptomes and genomes. The latter is a result of alternative splice events, variable and incomplete representations of transcripts in RNA-Seq libraries, while repetitive regions in genomes complicate their assembly. Usually, de novo assemblies result in thousands of short and incomplete transcripts (transfrags) or genomic sequences (contigs or scaffolds) while requiring a large amount of processing time and memory. However, with decreasing NGS costs reference genomes of many species have become available recently that can be used to guide and improve de novo assemblies. Here, we introduce two reference assisted algorithms BRANCH and AlignGraph. BRANCH improves transcriptome assemblies guided by genomic contigs from the same species or reference genes from a closely related species. AlignGraph improves genome assemblies with help provided by a closely related reference genome. In addition, we introduce the short read clustering algorithm SEED that is useful as a preprocessing tool in de novo assemblies by reducing their time and memory requirements.
SEED joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler, for the datasets used in this study, by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared to other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.
BRANCH's input includes assembled RNA reads, genomic sequences (e.g. contigs) and the RNA reads themselves. It uses a customized version of BLAT to align the transfrags and RNA reads to the genomic sequences. After identifying exons from the alignments, it defines a directed acyclic graph and maps the transfrags to paths on the graph. It then joins and extends the transfrags by applying an algorithm that solves a combinatorial optimization problem, called the Minimum weight Minimum Path Cover with given Paths. In performance tests on real data from Caenorhabditis elegans and Saccharomyces cerevisiae, assisted by genomic contigs from the same species, BRANCH improved the sensitivity and precision of transfrags generated by Velvet/Oases or Trinity by 5.1-56.7% and 0.3-10.5%, respectively. These improvements added 3.8-74.1% complete transcripts and 3.8-8.3% proteins to the initial assembly. Similar improvements were achieved when guiding the BRANCH processing of a transcriptome assembly from a more complex organism (mouse) with genomic sequences from a related species (rat).
AlignGraph is an algorithm for extending and joining de novo assembled contigs or scaffolds guided by closely related reference genomes. It aligns paired-end (PE) reads and pre-assembled contigs or scaffolds to a close reference. From the obtained alignments, it builds a novel data structure, called the paired-end multi-positional de Bruijn graph. The incorporated positional information from the alignments and PE reads allows us to extend the initial assemblies, while avoiding incorrect extensions and early terminations. In our performance tests, AlignGraph was able to substantially improve the contigs and scaffolds from several assemblers. For instance, 28.7-62.3% of the contigs of Arabidopsis thaliana and human could be extended, resulting in improvements of common assembly metrics, such as an increase of the N50 of the extendable contigs by 89.9-94.5% and 80.3-165.8%, respectively. In another test, AlignGraph was able to improve the assembly of a published genome (Arabidopsis strain Landsberg) by increasing the N50 of its extendable scaffolds by 86.6%. These results demonstrate AlignGraph's efficiency in improving genome assemblies by taking advantage of closely related references.