Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Novel Applications of Machine Learning and Statistics for Genome-resolved Metagenomic Data

Abstract

By sequencing environmental DNA and reconstructing microbial genomes, we can obtain insight into the previously hidden microbial world. This approach, known as genome-resolved metagenomics, has been utilized to study microorganisms in a variety of environments. Small sample sizes were common in genome-resolved metagenomics studies of the past, and thus few statistical methods of analysis were applied to the data resulting from these small-n studies. Instead, the analyses were focused on other aspects that did not require statistical methods, such as the identification of metabolic pathways possessed by the genomes and the phylogenetic relationships between organisms. However, in recent years, decreased sequencing costs and greater availability of computational resources have enabled scientists to sequence and process hundreds of samples for a single study. This dissertation demonstrates the application of several statistical and machine learning methods for the interpretation and strategic analysis of data from high-throughput genome-resolved metagenomic studies. Through the combination of new methods with previously existing methods, this work illustrates potential benefits that quantitative methods of analysis can offer to the field of genome-resolved metagenomics.

The first chapter of this dissertation serves as an example of a traditional genome-resolved metagenomics study, using primarily manual methods of analysis after the main steps of the data processing pipeline (including assembly, binning, and annotation) are complete. The manual methods of analysis applied in this small-scale study enable us to understand what microbes are present in a particular bioreactor community, and what metabolic functions these microbes are capable of. This contrasts with the much more data-intensive studies in the latter chapters, in which manual analyses would not be an efficient use of the data.

The second and third chapters, which are both focused on very large-scale data from the premature infant gut microbiome, illustrate the use of statistical methods for deciphering relationships in complex systems. This includes machine learning techniques applied to metagenome-associated genomes to make predictions that may potentially be useful in determining optimal care for a patient, as well as more basic statistical methods that allow us to better understand the gut microbiome and how it is influenced by external factors.

The fourth chapter is focused on the development of a new method that takes the hierarchical structure of genome-resolved metagenomic data into account. With genes in pathways, pathways in genomes, and genomes in communities of microorganisms, traditional ways of comparing samples fail to fully elucidate the biological systems because not all levels of the hierarchy are accounted for. To address this problem, the new concept described here allows for the inclusion of both functional and phylogenetic data to best utilize the wide breadth of information available in genome-resolved metagenomic data. The combination of quantitative approaches with genome-resolved metagenomics may lead to a more robust understanding of microbial communities.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View