UC Santa Cruz
Enabling comparative genomics at the scale of hundreds of species
- Author(s): Armstrong, Joel
- Advisor(s): Haussler, David
- et al.
Comparing related (homologous) subsequences between genomes from different species gives insight into their function. This information is captured in ``genome alignments'', which are essential for almost all comparative genomics analyses. However, most existing methods to create a genome alignment suffer from reference-bias (where only one genome is fully aligned to all others), or ignore duplication events. Though the Cactus genome aligner avoided these restrictions, it could not align more than a few genomes without becoming cost-prohibitive as well as losing accuracy. I developed and refined a “progressive alignment” extension to Cactus to allow it to produce a full alignment in time linear in the number of input genomes while maintaining similar, or often improved, quality. This new method allows Cactus to align hundreds of large vertebrate genomes---enabling comparative genomics at an unprecedented scale. During its development I used Cactus as an essential component of several successful comparative genomics projects. Working closely with the 200 Mammals and Bird 10K projects, I have used Cactus to create an alignment of over 600 bird and mammal genomes, which is by far the largest genome alignment ever created. Finally, I have utilized this alignment to provide a highest-possible-resolution annotation of mammalian and avian evolutionary constraint, using the uniquely large number of taxa to enable the examination of weak effects of purifying selection.