Computational Methods for Comparative Genomic and Epigenomic Annotations across Multiple Species
In recent years Genome Wide Association Studies (GWAS) and large-scale whole genome sequencing case-control studies have led to the identification of a wealth of phenotype-associated and rare genetic variants. Interpreting the biological significance of these variants has been a significant challenge, especially since a large majority of their genomic locations fall within non-protein coding genomic regions. Here we present a computational method, ConsHMM, for annotating the genome at single-nucleotide resolution into a set of conservation states learned from the combinatorial and spatial patterns of species aligning and matching a reference genome in a multiple-sequence alignment. Conservation states have specific enrichments for orthogonal biological annotations and can be used for interpreting genetic variants. We provide here a comprehensive resource of conservation state annotations, the ConsHMM atlas, comprised of models and annotations for eight different organisms based on several multiple-sequence alignments. At the epigenomic level, modifications such as DNA methylation have emerged as useful biomarkers for several phenotypes, but a large majority of these phenotypes have been studied predominantly in human samples. Leveraging sequence conservation among genomes, we have designed a methylation array that can query DNA methylation of many different mammals, and therefore facilitate cross species epigenetic studies. The array has been produced and used to profile 8730 samples from 145 different mammals. In summary, this work takes a comparative genomics based approach to expanding the available genomic and epigenomic annotations of multiple species.