Identifying Population Histories, Adaptive Genes, and Genetic Duplication from Population-Scale Next Generation Sequencing
The arrival of next-generation sequencing (NGS) technologies in the mid 2000s opened the floodgates to a massive amount of genetic data. Not only does NGS permit relatively easy access to the genome of nearly any species, it also enables sequencing highly degraded DNA characteristic of ancient samples and museum specimens. The representation of genomic data across the tree of life has been spreading rapidly over the past decade owing to the emergence of numerous methods for inexpensively sequencing entire genomes and reduced representations of genomes based on NGS. However, without any high-quality preexisting genomic resources, species with large, highly paralogous genomes pose a major obstacle for NGS because accurately assembling short read data becomes extremely challenging. Furthermore, reads derived from paralogs will likely map to the same locus, which can inflate apparent levels of diversity, obscuring accurate population genetic inference and scans for adaptive loci. These problems can also effect population genetic studies using historic DNA from museum specimens, which often face the additional challenges of high sampling variability across space and time, and DNA degradation. The research presented in this thesis aims at overcoming these challenges using a combination of pioneering experimental and computational approaches. First, I present a method for identifying paralogy from NGS data, ngsParalog, that jointly leverages information from read proportions within and across individuals and sequencing coverage in a probabilistic framework. Combining information in this manner achieves superior power for identifying paralogy at lower false positive rates than using paralogy signatures separately as other current methods do. It also is widely applicable to both single and paired-end data ranging from low to high coverage. I use ngsParalog to detect paralogy in humans, chipmunks, and stick insects, representing a broad range of sequencing approaches. In the next chapter of the thesis I, along with colleagues, demonstrate how transcriptome-enabled exon capture applied to populations of century-old and modern Tamias chipmunks comprising multiple species, in conjunction with a new Approximate Bayesian Computation approach for fitting joint site frequency spectra between time periods can be used to infer recent population histories. Knowing these population histories allowed for disentangling the genetic signature of demographic changes from selection, which led to identifying a gene that may be helping chipmunk populations rapidly adapt to climate-induced environmental change. In the fourth chapter, I, along with other colleagues, employed the same exon capture technique and ngsParalog to overcome the challenge of mapping color and pattern genes in the ~12 gigabase, highly paralogous genome of the mimic poison frog, Ranitomeya imitator. I applied statistical divergence and admixture mapping methods to differentR. imitator color morphs in order to identify seven out of 13,086 examined genes that showed compelling evidence of influencing color and/or pattern in R. imitator. These candidate genes will likely be valuable for gaining insight into the R. imitator mimetic radiation. The combination of methods presented in this thesis advances the utility of NGS into taxa with genomes that previously precluded gene mapping and provides an analytical framework for identifying demographies and adaptive genes from museum collections.