Enhancer evolution in the Drosophila montium subgroup
Enhancers drive spatiotemporal patterns of gene expression, and play critical roles in development, disease, and evolution. Decades of research have yielded key insights, but many questions remain unanswered. A hallmark of enhancer evolution is functional conservation in the presence of extensive sequence divergence. However, identifying important mutational events between divergent sequences has been challenging. To overcome this challenge, I adopted a comparative genomic approach: sequence and assemble dozens of closely related species, and study enhancer evolution at the earliest stages of divergence. Such a data set provides an unprecedented opportunity to identify key changes and events (along with their context) before they are obscured by additional mutations. I started by sequencing and assembling 23 genomes from the Drosophila montium subgroup, a large group of closely related species. I also aligned each montium assembly to the extensively annotated D. melanogaster genome. The average scaffold NG50 is 76 kb, but varies widely (400 - 19 kb) depending on repeat content and heterozygosity levels. Despite large differences in contiguity, all montium assemblies contain high percentages of known genes and enhancers - demonstrating their suitably for this comparative genomic approach. To support my subsequent analyses, I also reconstructed the montium subgroup phylogeny using 20 Bicoid-dependent enhancers.
Next, I leveraged this new genomic resource to study enhancer evolution across 24 montium species and D. melanogaster. I started with the extensively characterized eve stripe 2 enhancer, and showed how patterns of (apparent) conservation and variation could be used to direct targeted mutagenesis experiments, and to inform models of enhancer grammar. To study binding site turnover on a large scale, I investigated hundreds of ChIP peaks for the transcription factors Bicoid, Krüppel, and Zelda. I treated groups of orthologous binding site scores as continuous traits, reconstructed ancestral scores at each node of the species tree, and then calculated score changes along each branch of the tree. For all three factors, binding sites were more likely to be gained along branches of the tree that also lost a binding site. This was true for both conserved and non-conserved sites, and most differences were statistically significant. However, I observed similar patterns when I repeated the analyses using shuffled matrices, leaving me unable to conclude these were meaningful changes in transcription factor binding. Future analyses will focus on mitigating the effects of several confounding factors, including non-functional montium sequences, the forced gradualism of the Brownian motion model, and ancestral character estimation with a single species tree in the presence of widespread incomplete lineage sorting and / or introgression.
Finally, in collaboration with Carolyn Elya and Michael Eisen, I worked on assembling the genome of the Drosophila-manipulating fungus Entomophthora muscae ‘Berkeley’. This is an excellent system with which to study the mechanistic basis of parasite-induced manipulations. Infected flies exhibit a suite of behavioral changes, including summit disease, proboscis extension / attachment, and raised / spread wings. Compared to most previously sequenced fungi, the genome is extremely large and repetitive. The total scaffold length is 1.24 Gb, but the haploid genome size might be around 650 Mb. Polyploidy appears to be common among related entomopathogenic fungi, so estimating the haploid genome size in the absence of additional experimental data is challenging. At least 85 % of the genome is repeats. In fact, the genome is so repeat-rich that aligning any pair of scaffolds produces characteristic X-alignments, where the forward strand of the first scaffold also aligns to the reverse complement of the second scaffold. The assembly appears to be missing many known fungal genes, but the significance of this is unclear. For genes that are present, the genome often appears to contain two distinct haplotypes. In many cases these haplotypes were assembled independently on different scaffolds, but many were also collapsed into single sequences. The alignment of PacBio long-reads to the assembly suggests that it contains numerous mis-assemblies. This was probably unavoidable given the genome’s dense repeat structure. Future efforts will focus on improving the assembly. Going forward, the E. muscae ‘Berkeley’ genome will support our efforts to understand the molecular basis of fungal-induced behavioral manipulations in D. melanogaster.