Bioinformatic Approaches for New Insights into Old Marine Metagenomic Data Sets
Sampling from marine environments followed by en masse high-throughput nucleic acid sequencing (metagenomics) will transform our understanding of marine microbial communities and their impacts on Earth biogeochemistry. The interpretation of a marine metagenomic (MMG) data set within its environmental context, and extrapolation to planetary scale, require first that we answer simpler questions about a sample, "who is there?" and "what are they doing?" However, often the majority of sequenced genomic fragments (reads) that comprise a MMG data set cannot be assigned taxonomic or functional annotation by searches in reference sequence databases. This stems partly from the bias of reference databases to microbes amenable to study in the lab, the "culturable 1%." Also, the complexity of MMG samples—thousands of populations per sample, and quadrillions to quintillions of nucleotides—leads to extreme underrepresentation by the reads. Moreover, the shortness of reads makes annotation difficult. There are now tens of thousands of MMG data sets, each with many thousands to billions of mostly unannotated reads.
Learning from this massive resource of raw data will require new bioinformatics approaches. This dissertation explores approaches that pool different MMG data sets to benefit annotation and the discovery of widespread marine microbes. First, we hypothesized that pooling MMG data sets, and assembling the reads into longer sequences (contigs), would increase species and functional annotation of the reads. This proved true for the forty-two real MMG data sets we investigated. For simulated MMG data sets, pooled contigs were found to rarely mix reads from different species. This supports that pooled contigs, though a consensus of reads from different populations, are biologically interpretable, and that annotation may be transferred to constituent reads.
Second, given the high computational cost of assembly and the huge number of MMG data sets from which to select for pooling, we hypothesized that ranking data sets would make pooled assembly more efficient. This was correct. Ranking data sets by k-mer profile similarity resulted in pooled assembly rates on a par with ranking based on phylogenetic profile similarity. In practical terms, this means one can exploit pooled assembly to increase annotation without first having to annotate sets individually to create phylogenetic profiles.
Third, we found that pooling of MMG data sets can enable the discovery of ubiquitous and abundant marine microbial species and partial characterization of their genomes, without need for culturing them. This was accomplished by "geographic profiling" of public, pooled-assembled MMG data. Three novel species were predicted in multiple MMG sets and substantiated with orthogonal lines of evidence. Experimental work corroborated predicted sequence fragments for each of the species, and analyses of these fragments supported that they likely represent abundant, ubiquitous novel species.