Search

Article
Peer Reviewed

Higher classification sensitivity of short metagenomic reads with CLARK-S.

UC Riverside Previously Published Works (2016)

The growing number of metagenomic studies in medicine and environmental sciences is creating increasing demands on the computational infrastructure designed to analyze these very large datasets. Often, the construction of ultra-fast and precise taxonomic classifiers can compromise on their sensitivity (i.e. the number of reads correctly classified). Here we introduce CLARK-S, a new software tool that can classify short reads with high precision, high sensitivity and high speed.

Availability and implementation

CLARK-S is freely available at http://clark.cs.ucr.edu/ CONTACT: stelo@cs.ucr.eduSupplementary information: Supplementary data are available at Bioinformatics online.

Cover page: Higher classification sensitivity of short metagenomic reads with CLARK-S.

Thesis
Peer Reviewed

Computing the Microbiome: Faster, More Accurate and More Efficient Methods for the Analysis of Metagenomes

Ounit, Rachid
Advisor(s): Lonardi, Stefano

UC Riverside Electronic Theses and Dissertations (2017)

Metagenomics is revolutionizing microbial ecology and has unlocked unprecedented opportunities in many domains of Life Science. For instance, metagenomics has allowed the discovery of new forms of life in unexplored habitats (e.g., in the marine environment). In medicine, metagenomics is allowing doctors to diagnose and help patients that have diseases related to imbalances in their microbial communities (e.g., gastrointestinal microbiota). In public health, metagenomics is becoming an invaluable instrument for pathogen surveillance and to monitor outbreaks in epidemic areas.

As sequencing technologies have considerably improved in speed and cost over the past decade, the number of reference sequences in public databases has grown exponentially. As a consequence, faster, accurate and efficient computational methods are needed for analyzing these large data. The research presented in this dissertation focuses on (i) how to build faster, more accurate and more efficient sequence classification methods to determine the microbial composition of metagenomic samples and (ii) how to infer and recover the microbial composition of a sample in a large network of connected samples (e.g., in the context of a city-scale biosurveillance).

Our classification system is composed of a family of tools, namely CLARK, CLARK-l and CLARK-S, which are currently used by several research teams worldwide for metagenomics and genomics analysis. While CLARK is able to perform with high accuracy sequence classification and unprecedented speed, CLARK-S achieves the same precision and a much higher accuracy than CLARK, at a cost of a slightly slower speed.

Cover page: Computing the Microbiome: Faster, More Accurate and More Efficient Methods for the Analysis of Metagenomes

Creative Commons 'BY-NC' version 4.0 license

Article
Peer Reviewed

BRAT-nova: fast and accurate mapping of bisulfite-treated reads

UC Riverside Previously Published Works (2016)

Unlabelled

In response to increasing amounts of sequencing data, faster and faster aligners need to become available. Here, we introduce BRAT-nova, a completely rewritten and improved implementation of the mapping tool BRAT-BW for bisulfite-treated reads (BS-Seq). BRAT-nova is very fast and accurate. On the human genome, BRAT-nova is 2-7 times faster than state-of-the-art aligners, while maintaining the same percentage of uniquely mapped reads and space usage. On synthetic reads, BRAT-nova is 2-8 times faster than state-of-the-art aligners while maintaining similar mapping accuracy, methylation call accuracy, methylation level accuracy and space efficiency.

Availability and implementation

The software is available in the public domain at http://compbio.cs.ucr.edu/brat/

Contact

elenah@cs.ucr.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

Cover page: BRAT-nova: fast and accurate mapping of bisulfite-treated reads

Article
Peer Reviewed

A Metagenomic Analysis of Environmental and Clinical Samples Using a Secure Hybrid Cloud Solution.

UC Riverside Previously Published Works (2019)

The number and types of studies about the human microbiome, metagenomics and personalized medicine, and clinical genomics are increasing at an unprecedented rate, leading to computational challenges. For example, the analysis of patient/clinical samples requires methods capable of (i) accurately detecting pathogenic organisms, (ii) running with high speed to allow short response-time and diagnosis, and (iii) scaling to ever growing databases of reference genomes. While cloud-computing has the potential to offer low-cost solutions to these needs, serious concerns regarding the protection of genomic data exist due to the lack of control and security in remote genomic databases. We present a novel metagenomic analysis system called "Virgile" that is capable of performing privacy-preserving queries on databases hosted in outsourced servers (e.g., public or cloud-based). This method takes as input the sequenced data produced by any modern sequencing instruments (e.g., Illumina, Pacbio, Oxford Nanopore) and outputs the microbial profile using a database of whole genome sequences (e.g., the RefSeq database from NCBI). The algorithm for the microbial profile aims to estimate without bias the abundance of microorganisms present using a genome-centric approach. Result: Using an extensive set of 65 simulated datasets, negative and positive controls, real clinical samples, and mock communities, we show that Virgile identifies and estimates the abundance of organisms present in environmental or clinical samples with high accuracy compared to state-of-the-art and popular methods available, including MetaPhlAn2 and KrakenUniq. Running at high speed, Virgile can also be run on a standard 8 GB RAM laptop. Virgile is a novel privacy-preserving abundance estimation algorithm called Virgile that can efficiently and rapidly discern the abundance and taxonomic identification of organisms present in a metagenomic sample, including those from medical environments. To the best of our knowledge, Virgile is the only metagenome analysis system leveraging cloud computing in a secure manner.

Cover page: A Metagenomic Analysis of Environmental and Clinical Samples Using a Secure Hybrid Cloud Solution.

Article
Peer Reviewed

CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

UC Riverside Previously Published Works (2015)

Background

The problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.

Results

We introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions.

Conclusions

CLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/ .

Cover page: CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers

Article
Peer Reviewed

rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

UC Riverside Previously Published Works (2016)

Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.

Cover page: rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison

Article
Peer Reviewed

Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

UC Riverside Previously Published Works (2017)

Background

One of the main challenges in metagenomics is the identification of microorganisms in clinical and environmental samples. While an extensive and heterogeneous set of computational tools is available to classify microorganisms using whole-genome shotgun sequencing data, comprehensive comparisons of these methods are limited.

Results

In this study, we use the largest-to-date set of laboratory-generated and simulated controls across 846 species to evaluate the performance of 11 metagenomic classifiers. Tools were characterized on the basis of their ability to identify taxa at the genus, species, and strain levels, quantify relative abundances of taxa, and classify individual reads to the species level. Strikingly, the number of species identified by the 11 tools can differ by over three orders of magnitude on the same datasets. Various strategies can ameliorate taxonomic misclassification, including abundance filtering, ensemble approaches, and tool intersection. Nevertheless, these strategies were often insufficient to completely eliminate false positives from environmental samples, which are especially important where they concern medically relevant species. Overall, pairing tools with different classification strategies (k-mer, alignment, marker) can combine their respective advantages.

Conclusions

This study provides positive and negative controls, titrated standards, and a guide for selecting tools for metagenomic analyses by comparing ranges of precision, accuracy, and recall. We show that proper experimental design and analysis parameters can reduce false positives, provide greater resolution of species in complex metagenomic samples, and improve the interpretation of results.

Cover page: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

Article
Peer Reviewed

The genome of cowpea (Vigna unguiculata [L.] Walp.)

UC Riverside Previously Published Works (2019)

ABSTRACT

Cowpea ( Vigna unguiculata [L.] Walp.) is a major crop for worldwide food and nutritional security, especially in sub-Saharan Africa, that is resilient to hot and drought-prone environments. A high-quality assembly of the single-haplotype inbred genome of cowpea IT97K-499-35 was developed by exploiting the synergies between single molecule real-time sequencing, optical and genetic mapping, and a novel assembly reconciliation algorithm. A total of 519 Mb is included in the assembled sequences. Nearly half of the assembled sequence is composed of repetitive elements, which are enriched within recombination-poor pericentromeric regions. A comparative analysis of these elements suggests that genome size differences between Vigna species are mainly attributable to changes in the amount of Gypsy retrotransposons. Conversely, genes are more abundant in more distal, high-recombination regions of the chromosomes; there appears to be more duplication of genes within the NBS-LRR and the SAUR-like auxin superfamilies compared to other warm-season legumes that have been sequenced. A surprising outcome of this study is the identification of a chromosomal inversion of 4.2 Mb among landraces and cultivars, which includes a gene that has been associated in other plants with interactions with the parasitic weed Striga gesnerioides . The genome sequence also facilitated the identification of a putative syntelog for multiple organ gigantism in legumes. A new numbering system has been adopted for cowpea chromosomes based on synteny with common bean ( Phaseolus vulgaris ).

Article
Peer Reviewed

The genome of cowpea (Vigna unguiculata [L.] Walp.)

UC Riverside Previously Published Works (2019)

Cowpea (Vigna unguiculata [L.] Walp.) is a major crop for worldwide food and nutritional security, especially in sub-Saharan Africa, that is resilient to hot and drought-prone environments. An assembly of the single-haplotype inbred genome of cowpea IT97K-499-35 was developed by exploiting the synergies between single-molecule real-time sequencing, optical and genetic mapping, and an assembly reconciliation algorithm. A total of 519 Mb is included in the assembled sequences. Nearly half of the assembled sequence is composed of repetitive elements, which are enriched within recombination-poor pericentromeric regions. A comparative analysis of these elements suggests that genome size differences between Vigna species are mainly attributable to changes in the amount of Gypsy retrotransposons. Conversely, genes are more abundant in more distal, high-recombination regions of the chromosomes; there appears to be more duplication of genes within the NBS-LRR and the SAUR-like auxin superfamilies compared with other warm-season legumes that have been sequenced. A surprising outcome is the identification of an inversion of 4.2 Mb among landraces and cultivars, which includes a gene that has been associated in other plants with interactions with the parasitic weed Striga gesnerioides. The genome sequence facilitated the identification of a putative syntelog for multiple organ gigantism in legumes. A revised numbering system has been adopted for cowpea chromosomes based on synteny with common bean (Phaseolus vulgaris). An estimate of nuclear genome size of 640.6 Mbp based on cytometry is presented.

Article
Peer Reviewed

Correction to: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers

UC Riverside Previously Published Works (2019)

Following publication of the original article [1], the authors would like to highlight the following two corrections.

Cover page: Correction to: Comprehensive benchmarking and ensemble approaches for metagenomic classifiers