Search

Article
Peer Reviewed

Hybrid Clustering of Long and Short-read for Improved Metagenome Assembly

Joint Genome Institute (2021)

ABSTRACT

Next-generation sequencing has enabled metagenomics, the study of the genomes of microorganisms sampled directly from the environment without cultivation. We previously developed a proof-of-concept, scalable metagenome clustering algorithm based on Apache Spark to cluster sequence reads according to their species of origin. To overcome its under-clustering problem on short-read sequences, in this study we developed a new, two-step Label Propagation Algorithm (LPA) that first forms clusters of long reads and then recruits short reads to these clusters. Compared to alternative label propagation strategies, this hybrid clustering algorithm (hybrid-LPA) yields significantly larger read clusters without compromising cluster purity. We show that adding an extra clustering step before assembly leads to improved metagenome assemblies, predicting more complete genomes or gene clusters from a synthetic metagenome dataset and a real-world metagenome dataset, respectively. These results suggest that hybrid-LPA is a good alternative to current metagenome assembly practice by providing benefits in both scalability and accuracy on large metagenome datasets.

Availability and implementation

https://bitbucket.org/zhong_wang/hybridlpa/src/master/ .

Contact

zhongwang@lbl.gov

Cover page: Hybrid Clustering of Long and Short-read for Improved Metagenome Assembly

Article
Peer Reviewed

SpaRC: scalable sequence clustering using Apache Spark

LBL Publications (2019)

Motivation

Whole genome shotgun based next-generation transcriptomics and metagenomics studies often generate 100-1000 GB sequence data derived from tens of thousands of different genes or microbial species. Assembly of these data sets requires tradeoffs between scalability and accuracy. Current assembly methods optimized for scalability often sacrifice accuracy and vice versa. An ideal solution would both scale and produce optimal accuracy for individual genes or genomes.

Results

Here we describe an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomes and metagenomes from both short and long read sequencing technologies. It achieves near-linear scalability with input data size and number of compute nodes. SpaRC can run on both cloud computing and HPC environments without modification while delivering similar performance. Our results demonstrate that SpaRC provides a scalable solution for clustering billions of reads from next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar large-scale sequence data analysis problems.

Availability and implementation

https://bitbucket.org/berkeleylab/jgi-sparc.

Cover page: SpaRC: scalable sequence clustering using Apache Spark

Article
Peer Reviewed

SCALABLE PARALLEL NUMERICAL METHODS AND SOFTWARE TOOLS FOR MATERIAL DESIGN

UC San Diego Previously Published Works (1995)

A new method of solution to the local spin density approximation to the electronic Schr\"{o}dinger equation is presented. The method is based on an efficient, parallel, adaptive multigrid eigenvalue solver. It is shown that adaptivity is both necessary and sufficient to accurately solve the eigenvalue problem near the singularities at the atomic centers. While preliminary, these results suggest that direct real space methods may provide a much needed method for efficiently computing the forces in complex materials.

Cover page: SCALABLE PARALLEL NUMERICAL METHODS AND SOFTWARE TOOLS FOR MATERIAL DESIGN