The design and evaluation of methods for describing the diversity of microbial life in environmental samples is a critical step towards understanding life on earth and towards making prudent interventions in a wide variety of microbe-driven systems.
Microbes in the environment, including bacteria, archaea, viruses, and single-celled eukaryotes, are primary drivers of numerous geological and atmospheric processes, such as carbon fixation and sequestration, nutrient cycling, soil formation, and even cloud formation. Cyanobacteria in the surface of the ocean are estimated to be responsible for half of the primary production on earth. Microbes living in and on the human body are intimately involved in health and disease, even when they are not explicitly pathogenic; for instance, the gut is teeming with bacteria that are essential for digestion, but anomalies in this microbial community may contribute to disorders such as Crohn's disease. Environmental bacteria are critically important to climate change, agriculture, and public health, so understanding them has immediate practical importance, in addition to satisfying our scientific curiosity.
Environmental microbiology has long been limited by the fact that over 99% of bacteria found in the environment cannot yet be cultured, because the conditions required for growth have not yet been determined. In many cases, bacteria live in interdependent communities of species, making the growth conditions extremely complex and difficult to recreate, even if they could be determined. Thus, it is not possible to perform experiments on these organisms in the lab, or to acquire sufficient DNA to sequence their genomes in isolation. These limitations can be sidestepped through the use of culture-independent surveying techniques. With the availability of ever-cheaper DNA sequencing, methods that involve direct sequencing of DNA from environmental samples have now gained prominence, and are producing a deluge of data. However, the computational methods needed to make sense of these data are still in their infancy.
I evaluated methodological choices required for two kinds of culture-independent environmental sequencing techniques: taxonomic surveys using the 16S rRNA, and surveys of both taxonomy and function through shotgun sequencing. In both cases my goals were to increase the effectiveness of future studies in extracting biologically relevant information from environmental sequence datasets, and especially to head off misinterpretations of such datasets due to errors in methodology that have been overlooked to date.
Microbial community composition using the 16S ribosomal RNA sequence
PCR amplification and sequencing of the gene for the 16S ribosomal RNA subunit directly from environmental samples is a long-standing method of measuring species richness and relative abundance. I demonstrated that the use of sequencing reads that are much shorter than the gene itself (as has recently become economical and thus popular) has the potential to introduce substantial error in such studies. However, I also established, through exhaustive computational experiments, that a judicous choice of PCR and sequencing primers can avoid these errors. In particular, I found that the region following primer E517F provides the maximum available taxonomic information in diverse environments, and that sequencing more than 100nt provides little added value---a fact that justifies the use of next-generation sequencing technologies that are limited to a short read length. Notably, I obtained the same result both regarding supervised classification of sequences into known taxa and regarding unsupervised clustering of similar sequences into potentially unknown taxa. These are very different problems, so the congruence of results confirms that the region following E517F is indeed more informative than other regions.
Microbial species identification from environmental shotgun sequencing
The second culture-independent sequencing approach I addressed, known as "metagenomics" or "environmental genomics", does not target any specific gene but rather samples DNA sequences from the entire pool in an environment through shotgun sequencing. These data allow assessment of the range of metabolic functions present in a mixture of potentially many thousands of microbial species. A foundational problem in metagenomics is the assignment of sequences to known taxa, and the clustering of sequences into potentially unknown taxa. The surprising finding that sequence composition (i.e., statistical descriptions of the distribution of short words) can be discriminative of species identity has led to a wide range of proposed methods for both the supervised and the unsupervised variants of this "binning" problem, but the validation procedures applied to them have been both inconsistent and unrealistic. It has thus not been clear which method is best, or what performance can be expected in classifying real data. I reimplemented nearly all of the methods in the literature as special cases of a more general framework, allowing me to compare them on a common footing designed to mirror real circumstances.
Infrastructure for large-scale reproducible computational research
Each of the above projects relies on large-scale simulations, which require careful coordination of thousands of compute jobs and management of their inputs and outputs. This can be particularly daunting in the face of frequent updates to both input datasets and analysis programs, requiring recomputation of dependent results. To manage these computations, I developed Verdant (the "Versioned Data Analysis Tool"), a system for describing, sharing, and executing computational workflows on a cluster that guarantees reproducibility of results. It provides a means of ensuring that a set of computational results are up-to-date with respect to the inputs and thus that they are internally consistent. It also provides a means of sharing inputs, intermediate results, and final outputs in a manner that facilitates collaboration while avoiding redundant computation.