As DNA sequencing data becomes more prevalent, population genetics and evolutionary genomics are becoming increasingly more data driven, requiring the development of new tools to work with large datasets. In my doctoral research, I develop bioinformatic tools to study two important contributors to evolution: admixture and symbiosis. In my first chapter, I apply a coalescent theory model to Drosophila melanogaster populations and show evidence for African and European admixture in sub-Saharan populations of this species. In subsequent chapters, I develop a data mining approach to classify bacterial symbiont infections in publicly sequencing databases. Moreover, I investigate bacterial density within a wide range of host species and find evidence for symbiont induced host genome evolution.
A population pedigree is a graph that captures the totality of the family and genetic histories within a population. While pedigrees contain an abundance of advantageous information for genomic studies, assembling one is often tedious, time consuming, and fraught with error. A combination of highly multiplexed low-coverage sequencing, genotype imputation, and relationship inference software makes it feasible to develop a pedigree cheaply and efficiently. By applying this approach to an experimental admixed Drosophila melanogaster population we developed a dataset that contains genome-wide variants for thousands of individuals in our population. We were also able to confidently identify over one thousand parent-offspring relationships from almost four thousand sequenced samples. However, we were not able to construct a complete pedigree due to overestimates of relatedness resulting from our population’s mixed ancestry. Implementing software that account for population structure could rectify this issue and provide more accurate relationship inference within our population.
The COVID-19 pandemic of 2020 was one of the first major global public health crises in the post-genomic era, inspiring truly unprecedented levels of viral genome sequencing. In the realm of phylogenetics, or the reconstruction of ancestral relationships between extant sequences, essentially no software existed capable of handling the full dataset in a timely and effective manner. Phylogenetics is critical for the identification and tracking of major variants, particularly the famous Variants of Concern (VOC), leading to a desperate need for scalable tools. I, along with several collaborators, developed an efficient toolkit for the construction, manipulation, and analysis of massive phylogenetic trees. Our core data structure, the mutation annotated tree (MAT), is capable of storing millions of SARS-CoV-2 genomes in less than a gigabyte of data. My key contribution was the development of matUtils, a C++ library and command line toolkit to manipulate these highly compact data files. I additionally developed BTE, a highly efficient API making our phylogenetics software available in a Python environment. I subsequently developed analytical approaches taking advantage of these new tools with the availability and massive scale of the SARS-CoV-2 data. Among these is scalable phylogeographic inference, through the daily-updated website Cluster-Tracker. Cluster-Tracker uses a simple heuristic I developed to efficiently identify and present local transmission clusters for public health track-and-trace efforts. I also designed an approach to the identification of novel SARS-CoV-2 strains and integrated it with the popular Pango lineage system. Altogether, this dissertation presents a body of work contributing substantially to effective global public health response to the SARS-CoV-2 pandemic.
Transfer RNAs (tRNAs) are essential components of translation across all domains of life. The importance of this function is reflected in the strength of their conservation at the genome level, as well as their presence in hundreds of copies across each eukaryotic genome. Their strong conservation and high copy number at the genome level, in conjunction with their extensive post-transcriptional modifications and extreme variation in transcriptional activity by locus, make tRNA genes an enticing but as yet understudied model gene family.The requirement of tRNA transcripts in exceptionally large quantities causes tRNA loci to experience among the highest rates of transcription in the genome. Consequently, transcription-associated mutagenesis (TAM) and natural selection leave distinct genomic signatures at highly transcribed tRNA loci, such that tRNA genes are strongly conserved despite elevated mutation rates, and their immediate flanking regions are among the most variable sites in the genome. Here, I characterize the relationship between expression, mutation, and selection at tRNA loci in detail by using population genetics, comparative genomics, epigenetics, and transcriptomic data. I then use these findings to engineer a random-forest model to predict tRNA gene transcriptional activity using only DNA data. In the second half of this dissertation, I use the comparative genomics skills developed in the first part to help develop a novel phylogenetics toolkit. I identify the effects of sequencing errors on large SARS-CoV-2 phylogenies at global and local scales, demonstrate a novel method to quickly add samples to phylogenies, and explore recombination events in SARS-CoV-2 data, finding an excess in the region surrounding the Spike protein. In this dissertation, I use publicly available DNA, RNA, and epigenetic data to develop novel bioinformatic analysis methods. Together, the conclusions drawn in this dissertation for both tRNA biology and SARS-CoV-2 answer fundamental evolutionary questions.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.