Skip to main content
Open Access Publications from the University of California

UC San Diego

UC San Diego Electronic Theses and Dissertations bannerUC San Diego

Benchmarking and Acceleration of Machine Learning and Analytics Pipelines for Large Microbiome Datasets


Within the past decade, the number of publicly available microbiome sequencing samples has increased dramatically. Consequently, bottlenecks have arisen in common analysis steps, such as processing the sequencing data and characterizing the content of the microbial communities. Over this timespan, new tools have also been developed for steps such as alignment and dimensionality reduction that scale better or handle the additional complexity of high-dimensional data, however, their characteristics on microbiome data were previously uncharacterized. In this dissertation, we accelerate the analysis of microbiomes by introducing new methods or benchmarking alternatives. Additionally, we compare the results of novel methodology to existing best-practices on gold-standard datasets to determine whether the methods adequately address the specific challenges of microbiome data.

In the first part of this work, Chapter 1 reviews many aspects of microbiome data that necessitate the use of microbiome-specific techniques for analyzing collections of microbial communities. Chapter 2 then introduces SFPhD, a novel approach for calculating phylogenetic alpha diversity that leverages the characteristics of microbiome data to speed up and reduce the memory requirements of a costly single-sample characterization.

In the second part of the work, we apply recently developed tools for machine learning and sequencing pre-processing to demonstrate their potential for elucidating complex relationships in microbial data and reducing the lead time for supporting clinical applications of metagenomic sequencing, respectively. Chapter 3 demonstrates how Uniform Manifold Approximation and Projection (UMAP) provides succinct representations of data compared to the long-time standard method of microbial ecology, Principal Coordinates Analysis (PCoA). Importantly, UMAP provides different guarantees about the preservation of local/global geometry in its representation and careful consideration should be given to its application. In Chapter 4, we show that the popular metagenomic preprocessing pipeline of Atropos for adapter trimming and Bowtie2 for host filtering can be replaced by a substantially faster combination of Fastp and Minimap2, respectively. Furthermore, we have determined that the results this new pipeline produces are comparable to the outputs produced by the original pipeline.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View