Deep metagenomic sequencing has the potential to illuminate the intra-species genomic variation of abundant microbial species. In this thesis, I develop a new tool MIDAS (Metagenomic Intra-species Diversity Analysis System) for rapidly and automatically quantifying species abundance, single nucleotide polymorphisms (SNPs), and gene copy number variants (CNVs) from metagenomes. To illustrate the utility of this approach, I reanalyze three public datasets with MIDAS. First, I re-analyze stool metagenomes from 98 mother-infant pairs and used rare SNPs to track strain transmission. I find that early colonizers are likely transmitted from the mother whereas late colonizers are likely transmitted from the environment. Second, I re-analyze >300 stool metagenomes from healthy adults and use SNPs to identify examples of both strain co-existence and strain coexclusion. Third, I re-analyze 198 globally distributed marine metagenomes and used gene copy number variants to show that many species have population structure that correlates with geographic location. Strain level genetic variants clearly reveal extensive structure and dynamics that are obscured when metagenomes are analyzed at coarser taxonomic resolution.
High-throughput sequencing has firmly established itself as the leading method for assaying the structure and functional capacity of microbial communities. With this deluge of data, care must be taken to account for technical and biological artifacts in order to produce robust candidate biomarkers. Of particular interest is the use of mixed effects models and nonlinear models to assess key differences between healthy and diseased individuals that arise over time. In my thesis work, I analyzed data from a longitudinal study of inflammatory bowel disease in mice with the aim of uncovering biological features predictive of abnormal microbiome development in the context of chronic inflammation. My analysis uncovered multiple taxa and gene families that have differential temporal trajectories, as well as a few gene families that stratify the diseased and wild type subjects early on. This investigation led to a follow-up study of the underrepresented microbial genomes present in lab mice, to expand our knowledge of the model animal’s microbiome. Since the majority of microbiome studies aimed at future clinical impact are carried out in mice, it is important to know what separates human microbiomes from those of mice, in order to limit hypotheses that are not transferrable. We found that even a modest single cell sequencing effort leads to an appreciable gain in phylogenetic diversity and significantly improves the recruitment of short reads from unrelated mouse metagenomes. Overall, I have shown that robust findings are possible even with a limited set of subjects if one leverages a nuanced statistical modeling approach and undertakes targeted acquisition of new data.
Gene regulation can contribute to phenotypic divergence across species and cell types. By comparing regulatory regions between cell types and between species we can gain an understanding of how sequence changes affect gene regulation and ultimately organismal phenotypes and disease. Using computational methods, I quantified motif enrichment between sets of enhancers in order to characterize functional differences. I was able to identify transcription factors that showed a significant difference in the number of motifs enriched in homologous mouse and human cardiomyocyte enhancers. I also identified differentially enriched transcription factor motifs in embryonic stem cells and differentiated cardiomyocytes. These same methods were also applied to a third dataset in order to detect differences between binding sites that were unique to mutant SOX2 and binding sites that were shared between wildtype and mutant SOX2 binding sites. I found significant depletion of the OCT4:SOX2 motif in mutant SOX2 binding sites. In addition to this, my work also used a comparative genomics approach to identify regions that evolved rapidly in the bat ancestor, but are highly conserved in other vertebrates. I discovered 166 bat accelerated regions (BARs) that overlap epigenetic marks in developing mouse limbs and validated their function in limb development. Of particular note was an enhancer near the HoxD cluster that shows forelimb specific expression in bats compared to mice.
Congenital heart defects (CHD) occur in nearly one percent of live births each year and are the leading cause of defect-associated infant mortality. In spite of the growing size of disease cohorts, the molecular underpinnings of most cases remain unexplained. Given its high recurrence rate in families, we expect much of this contribution to be found within patient genomes, but extensive genetic heterogeneity limits our ability to statistically confirm risk loci. Previously-validated causal mutations occur in a wide range of genes that encode for proteins in signaling and migration, chromatin remodelers that induce lineage specification, and transcription factors regulating the expression of these genes. In order to identify cryptic risk loci, my thesis has focused on creating novel computational approaches to overcome statistical challenges and broaden our understanding of the mechanisms that can lead to CHD. By integrating protein-protein interaction networks of cardiac transcription factors with whole exome sequencing, I showed that interactors are enriched for rare and de novo mutations in CHD patients. I developed a variant prioritization scheme for de novo variants, which identified a GLYR1 mutation that destabilizes its interaction with cardiac transcription factor GATA4. I describe GCOD, a novel algorithm that uses probabilistic modeling to identify sets of genes predicted to interact in the etiology of CHD, including a novel genetic interaction between GATA6 and POR. Finally, in addition to coding mutations, I aimed to assess whether disruption to chromatin organization contributes to disease by characterizing three CHD patient variants that I predicted would alter the regulatory landscape of heart-relevant genes. My work has increased our repertoire of known and suspected disease loci in CHD and related developmental co-morbidities, and provided evidence of oligogenic combinations and disrupted genome folding as a mechanism in CHD.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.