Coordinating the expression of more than 20,000 human genes in the correct spatial and temporal patterns is a complex process. Surprisingly, there is only a small difference between the number of genes in human compared to fly, but the differences may be explained by the regulatory processes that control the complexity of these organisms. Part of the regulatory process is encoded in DNA by cis-regulatory sequences, or motifs, which are bound by trans-acting factors on DNA (transcriptional regulation) or mRNA (post-transcriptional regulation). Tight control of gene expression is important because diseases may arise with misregulation of genes. Therefore, understanding these regulatory pathways may lead to improvement of human health through the development of therapeutics.
With the increasing number of sequenced genomes and high-throughput experimental data being generated, bioinformatics methods have proven useful for identifying cis-regulatory sequences and trans-acting factors. This dissertation focuses on developing, validating, and applying computational methodologies to experimental data obtained from whole genome expression profiles and chromatin-immunoprecipitation-on-chip to understand regulatory pathways in mammalian species. As part of this work, we developed a novel computational algorithm, CompMoby, which uses species specific and evolutionary conservation information for the prediction of motifs from sets of co-regulated genes without prior knowledge of the trans-acting factors. Some of the putative motifs identified by CompMoby were verified through experimental validation utilizing molecular biology techniques such as reporter constructs and quantitative RT-PCR, while others were validated from published literature.
The major findings from this work include: successful application of CompMoby to high-throughput experimental data in various mammalian systems; identification of novel cis-regulatory sites active in mouse and human embryonic stem cells; characterization of transcription factor, NF-Y, which is required for embryonic stem cell proliferation; identification of the glucocorticoid and androgen receptor consensus motifs, putative co-factors, and conservation of the binding site and surrounding regions; and successful application of CompMoby to identify human tissue-specific miRNA targets. The main contribution of this work is the development of a computational algorithm for systematic identification of cis-regulatory sequences that can be broadly applied to understanding regulatory pathways in various mammalian systems.