Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Strain-resolved metagenomic analysis of the premature infant microbiome and other natural microbial communities

Abstract

Microorganisms are critical to immune system development and physiology, yet the factors that drive initial colonization and human microbiome assembly are largely unknown. Early molecular approaches to study the microbiology of human body sites relied upon the direct amplification and sequencing of 16S rRNA genes, but this only produces a coarse catalog of the organisms present. In contrast, genome-resolved metagenomics involves recovery of genomes directly from whole community DNA, enabling prediction of biosynthetic capacities and providing the ability to differentiate the capabilities of closely related strains. However, genome-resolved metagenomics is computationally challenging and validated methods for many types of analyses are lacking. In this thesis, custom genome-resolved metagenomic methods were developed to determine the structure of microbial communities, monitor the activities of bacteria in situ, and track their evolution. The research focused primarily on the colonization and development of microbial communities that live in, and on, premature infants. Discovered patterns of fine-scale bacterial diversity, evolution, and functional potential shed light on early microbiome assembly, and highlight factors that contribute to necrotizing enterocolitis, one of the most common diseases of premature infants. Transmission pathways and reservoirs of bacteria and microbial eukaryotes that can cause nosocomial infections were identified.

Due to the sensitivity of metagenomic methods, foreign DNA sequences can be detected in metagenomic datasets. A case-study involving the identification of the source of introduced sequences in metagenomes was conducted and the specific physical source of the contaminant identified. Though very detailed genome sequence comparisons it was possible to measure the in situ evolution rate of the reagent contaminant over a three year period, enabling the near correct estimation of the introduction time of the contaminant into the reagent production facility. The research established the methods for genome-resolved metagenomics-based microbial forensics and strain tracking and showed that application is possible, even in extremely complex environments like soil.

High-resolution analyses are needed to determine if microbes in different environments are the result of strain transfer events. The necessary genome-wide comparisons cannot be performed with 16S rRNA sequencing, the most common method for study of the human microbiome, leaving basic questions related to the strain-level diversity and body-site specificity unanswered. To address these questions, strain-tracking analyses were applied to metagenomes derived from samples of the mouth, skin, and gut microbiomes of premature infants. The results highlight the extreme lack of body-site diversity during very early colonization of premature infants in the neonatal intensive care unit. Surprisingly, identical bacterial genomes for organisms such as Escherichia coli, Klebsiella pneumoniae and Citrobacter koseri were found in the mouth, skin, and gut microbiomes. Differential genome coverage was used to measure their bacterial population replication rates in situ. In all cases, the replication rates for same bacterial populations in different body sites were faster in mouth and skin compared to the gut, despite the fact that these bacteria are traditionally considered gut colonists. Finally, strain-level analysis of polymorphic sites across the C. koseri genome were used to define 10 subpopulations, implying initial colonization of premature infants by multiple individual cells with distinct genotypes.

Methods for rapid, accurate and reproducible “de-replication”, the process of grouping recovered genomes together based on similarity and choosing the best representative genome from each group, are needed for genome-resolved metagenomic analyses. This task requires a number of steps, including mass pairwise genome comparison, evaluation of completeness and contamination, and generation of explanatory figures to visualize and validate the dereplication process. An open-source program, “dRep”, was developed and validated. The method achieves very similar results as naive pairwise clustering algorithms, but with an order of magnitude speed increase due to use of a biphasic algorithm. Importantly, individual samples in sample sets can now be assembled independently and the genomes effectively de-replicated, reducing the incidence of chimeric sequences and improving genome recovery over results for co-assemblies.

Delineation of bacteria as belonging to the same versus different species is a longstanding problem in microbiology. Fundamental questions related to the existence of species and how species should be differentiated remain, and the answers have both practical and evolutionary implications. A large public dataset comprised of >5,000 genomes acquired directly from metagenomes was analyzed. In conjunction, genome-based metrics that could be used to define bacterial species boundaries were evaluated. A distinct gap in the distribution of average nucleotide identity (ANI) values at 95% ANI exists, supporting the existence of discrete species. ANI was compared with metrics of selection for non-synonymous versus synonymous substitutions and for homologous recombination to identify processes that could lead to species clusters. The 95% ANI value corresponds approximately with the genetic distance beyond which homologous recombination drops to near zero. The findings implicate sequence divergence-based breakdown in homologous recombination as the evolutionary force responsible for bacterial speciation. 50 genes were evaluated to provide a practical means to define species content when genomes are not recovered from metagenomes for most community members. Although 16S rRNA gene sequences cannot be used for this purpose, the nucleotide sequences of several ribosomal proteins were found to be reasonable proxies for the relevant genome ANI value.

Microbial eukaryotes are particularly understudied in the human microbiome, yet they are considered to be emerging health threats. Genomes from microbial eukaryotes can be reconstructed from human microbiome metagenomic datasets. The results are greatly improved through use of a recently developed machine learning algorithm, EukRep, which can identify eukaryotic DNA based on the sequences alone. EukRep was used to scan thousands of metagenomes from the premature infant gut and hospital room environments and fourteen novel eukaryotic genomes were reconstructed. Two of these, for a Diptera (fly) and a Rhabtida (worm), were novel at the class level. Importantly from the perspective of tracking nosocomial agents, genomes from the same eukaryotic species were recovered from both infants and hospital room environments. Population heterogeneity and zygosity of genomes were lower in genomes recovered from the hospital room as compared to those recovered from premature infant samples, which could reflect years of inbreeding or strong selection imposed by room conditions. Together this work indicates that the hospital room, especially the sink, may be a reservoir of infant-colonizing fungal strains.

Necrotizing enterocolitis (NEC) involves extreme bowel inflammation and necrosis and has a mortality rate of around 30%. Various lines of evidence point to the human microbiome as a central factor in disease development, yet no consistent microbial signal for NEC onset has been identified. We performed large-scale genome-resolved metagenomic analyses of thousands of prospectively collected premature infant fecal samples and used a machine learning classifier to identify signals that predict imminent development of NEC. Significant associations were found related to the abundance of Klebsiella, bacteria encoding fimbriae, and specific types of secondary metabolite clusters, and the in situ growth rate of bacteria overall. NEC development could be promoted by metabolic imbalances related to the rampant growth of particular bacterial strains and/or the stimulation of human TLR4 receptors by Klebsiella and fimbriae. Together, the results identify potential biomarkers for early detection of NEC and possible targets for microbiome-based therapeutics and probiotics.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View