The regulation of gene expression is thought to play a critical role in the development of life's complexity and has become one of biology's most intensely-studied areas of research. This study has brought us, in a small set of model systems, a catalog of components and the mechanisms by which these work together to activate and repress expression. A variety of genomic approaches hold the promise of both generalizing these mechanisms and, through the use of statistical models, generating new insights. However, this new wealth of genomic information has provided more raw data than new understanding, due in part to the failure of these statistical models to account for the inherent complexity of biological information.
Here I demonstrate three approaches, spanning analysis of binding sites, promoter regions, and developmental enhancers, to create, gain insight from, and, most importantly, emphasize the need for more biologically informed statistical models.
First, I show how measuring the evolutionary properties of a transcription factor's binding sites can inform the differentiation of those sites from other sequences. That differentiation typically requires the interpretation of a score using a p-value, but, contrary to common usage, I find that the optimal such p-value threshold can differ greatly between transcription factors. Second, I develop a graphical model that can describe and exploit trends in the positioning of transcription factor binding sites within promoters. Binding sites are short and degenerate, not specifying by themselves enough information to mediate the organism's task of promoter recognition. However, I show that these positional trends can greatly increase the information available for recognition, further showing how they can be applied to the bioinformatic promoter recognition problem. Third, I use evolutionary simulations to construct a null model for the relative positioning and conservation of binding sites within developmental enhancers. I use this model to show that much of the evidence supporting the importance of overlapping and clustered sites as functional necessities of enhancer organization can be reproduced as artifacts of constraint on binding site composition alone. Finally, I discuss progress towards testing spatially scrambled enhancers generated from these models in transgenic Drosophila embryos.