Our inability to associate distant regulatory elements with the genes that they regulate has largely precluded their examination for sequence alterations contributing to human disease. One major obstacle is the large genomic space surrounding targeted genes in which such elements could potentially reside 1. In order to delineate gene regulatory boundaries we used whole-genome human-mouse-chicken (HMC) and human-mouse-frog (HMF) multiple alignments to compile conserved blocks of synteny (CBS), under the hypothesis that these blocks have been kept intact throughout evolution at least in part by the requirement of regulatory elements to stay linked to the genes that they regulate. A total of 2,116 and 1,942 CBS >200 kb were assembled for HMC and HMF respectively, encompassing 1.53 and 0.86 Gb of human sequence. To support the existence of complex long-range regulatory domains within these CBS we analyzed the prevalence and distribution of chromosomal aberrations leading to position effects (disruption of a gene s regulatory environment) 1,2, observing a clear bias not only for mapping onto CBS but also for longer CBS size. Our results provide a genome wide dataset characterizing the regulatory domains of genes and the conserved regulatory elements within them.
Cross-species DNA sequence comparison is the primary method used to identify functional noncoding elements in human and other large genomes. However, little is known about the relative merits of evolutionarily close and distant sequence comparisons, due to the lack of a universal metric for sequence conservation, and also the paucity of empirically defined benchmark sets of cis-regulatory elements. To address this problem, we developed a general-purpose algorithm (Gumby) that detects slowly-evolving regions in primate, mammalian and more distant comparisons without requiring adjustment of parameters, and ranks conserved elements by P-value using Karlin-Altschul statistics. We benchmarked Gumby predictions against previously identified cis-regulatory elements at diverse genomic loci, and also tested numerous extremely conserved human-rodent sequences for transcriptional enhancer activity using reporter-gene assays in transgenic mice. Human regulatory elements were identified with acceptable sensitivity and specificity by comparison with 1-5 other eutherian mammals or 6 other simian primates. More distant comparisons (marsupial, avian, amphibian and fish) failed to identify many of the empirically defined functional noncoding elements. We derived an intuitive relationship between ancient and recent noncoding sequence conservation from whole genome comparative analysis, which explains some of these findings. Lastly, we determined that, in addition to strength of conservation, genomic location and/or density of surrounding conserved elements must also be considered in selecting candidate enhancers for testing at embryonic time points.
The availability of the assembled mouse genome makes possible, for the first time, an alignment and comparison of two large vertebrate genomes. We have investigated different strategies of alignment for the subsequent analysis of conservation of genomes that are effective for different quality assemblies. These strategies were applied to the comparison of the working draft of the human genome with the Mouse Genome Sequencing Consortium assembly, as well as other intermediate mouse assemblies. Our methods are fast and the resulting alignments exhibit a high degree of sensitivity, covering more than 90 percent of known coding exons in the human genome. We have obtained such coverage while preserving specificity. With a view towards the end user, we have developed a suite of tools and websites for automatically aligning, and subsequently browsing and working with whole genome comparisons. We describe the use of these tools to identify conserved non-coding regions between the human and mouse genomes, some of which have not been identified by other methods.
Motivation. The power of multi-sequence comparison for biological discovery is well established and sequence data from a growing list of organisms is becoming available. Thus, a need exists for computational strategies to visually compare multiple aligned sequences to support conservation analysis across various species. To be efficient these visualization algorithms require the ability to universally handle a wide range of evolutionary distances while taking into account phylogeny Results. We have developed Phylo-VISTA, an interactive tool for analyzing multiple alignments by visualizing the similarity of DNA sequences among multiple species while considering their phylogenic relationships. Features include a broad spectrum of resolution parameters for examining the alignment and the ability to easily compare any subtree of sequences within a complete alignment dataset. Phylo-VISTA uses VISTA concepts that have been successfully applied previously to a wide range of comparative genomics data analysis problems. Availability Phylo-VISTA is an interactive java applet available for downloading at http://graphics.cs.ucdavis.edu/~;nyshah/Phylo-VISTA. It is also available on-line at http://www-gsd.lbl.gov/phylovista and is integrated with the global alignment program LAGAN at http://lagan.stanford.edu.Contactphylovista@lbl.gov
We have developed Phylo-VISTA (Shah et al., 2003), an interactive software tool for analyzing multiple alignments by visualizing a similarity measure for DNA sequences of multiple species. The complexity of visual presentation is effectively organized using a framework based upon inter-species phylogenetic relationships. The phylogenetic organization supports rapid, user-guided inter-species comparison. To aid in navigation through large sequence datasets, Phylo-VISTA provides a user with the ability to select and view data at varying resolutions. The combination of multi-resolution data visualization and analysis, combined with the phylogenetic framework for inter-species comparison, produces a highly flexible and powerful tool for visual data analysis of multiple sequence alignments.
The identification of enhancers with predicted specificities in vertebrate genomes remains a significant challenge that is hampered by a lack of experimentally validated training sets. In this study, we leveraged extreme evolutionary sequence conservation as a filter to identify putative gene regulatory elements and characterized the in vivo enhancer activity of human-fish conserved and ultraconserved1 noncoding elements on human chromosome 16 as well as such elements from elsewhere in the genome. We initially tested 165 of these extremely conserved sequences in a transgenic mouse enhancer assay and observed that 48 percent (79/165) functioned reproducibly as tissue-specific enhancers of gene expression at embryonic day 11.5. While driving expression in a broad range of anatomical structures in the embryo, the majority of the 79 enhancers drove expression in various regions of the developing nervous system. Studying a set of DNA elements that specifically drove forebrain expression, we identified DNA signatures specifically enriched in these elements and used these parameters to rank all ~;3,400 human-fugu conserved noncoding elements in the human genome. The testing of the top predictions in transgenic mice resulted in a three-fold enrichment for sequences with forebrain enhancer activity. These data dramatically expand the catalogue of in vivo-characterized human gene enhancers and illustrate the future utility of such training sets for a variety of iological applications including decoding the regulatory vocabulary of the human genome.
The genome sequence of a second fruit fly, D. pseudoobscura, presents an opportunity for comparative analysis of a primary model organism D. melanogaster. The vast majority of Drosophila genes have remained on the same arm, but within each arm gene order has been extensively reshuffled leading to the identification of approximately 1300 syntenic blocks. A repetitive sequence is found in the D. pseudoobscura genome at many junctions between adjacent syntenic blocks. Analysis of this novel repetitive element family suggests that recombination between offset elements may have given rise to many paracentric inversions, thereby contributing to the shuffling of gene order in the D. pseudoobscura lineage. Based on sequence similarity and synteny, 10,516 putative orthologs have been identified as a core gene set conserved over 35 My since divergence. Genes expressed in the testes had higher amino acid sequence divergence than the genome wide average consistent with the rapid evolution of sex-specific proteins. Cis-regulatory sequences are more conserved than control sequences between the species but the difference is slight, suggesting that the evolution of cis-regulatory elements is flexible. Overall, a picture of repeat mediated chromosomal rearrangement, and high co-adaptation of both male genes and cis-regulatory sequences emerges as important themes of genome divergence between these species of Drosophila.
Chromosome 5 is one of the largest human chromosomes yet has one of the lowest gene densities. This is partially explained by numerous gene-poor regions that display a remarkable degree of noncoding and syntenic conservation with non-mammalian vertebrates, suggesting they are functionally constrained. In total, we compiled 177.7 million base pairs of highly accurate finished sequence containing 923 manually curated protein-encoding genes including the protocadherin and interleukin gene families and the first complete versions of each of the large chromosome 5 specific internal duplications. These duplications are very recent evolutionary events and play a likely mechanistic role, since deletions of these regions are the cause of debilitating disorders including spinal muscular atrophy (SMA).
We report here the 78,884,754 base pairs of finished human chromosome 16 sequence, representing over 99.9 percent of its euchromatin. Manual annotation revealed 880 protein coding genes confirmed by 1,637 aligned transcripts, 19 tRNA genes, 341 pseudogenes and 3 RNA pseudogenes. These genes include metallothionein, cadherin and iroquois gene families, as well as the disease genes for polycystic kidney disease and acute myelomonocytic leukemia. Several large-scale structural polymorphisms spanning hundreds of kilobasepairs were identified and result in gene content differences across humans. One of the unique features of chromosome 16 is its high level of segmental duplication, ranked among the highest of the human autosomes. While the segmental duplications are enriched in the relatively gene poor pericentromere of the p-arm, some are involved in recent gene duplication and conversion events which are likely to have had an impact on the evolution of primates and human disease susceptibility.
Cookie SettingseScholarship uses cookies to ensure you have the best experience on our website. You can manage which cookies you want us to use.Our Privacy Statement includes more details on the cookies we use and how we protect your privacy.