Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

Characterization of primate structural variation using diverse sequencing technologies

Abstract

Elucidating the genetic changes underlying the evolution of human traits remains an unfinished puzzle. Structural variants (SVs) account for more genetic differences than single-nucleotide polymorphisms between humans and our closest living relatives, chimpanzees, and are a hallmark of great ape evolution. The genomes of great apes are enriched in large interspersed segmental duplications (SDs), defined as duplications larger than 1 kbp with over 90% sequence identity, that sensitize the genome to further genomic rearrangements, including copy-number variation, via non- allelic homologous recombination. Despite their relevance, the identification and characterization of these SVs has been hindered by short reads lengths as they lack enough sequence context to discover breakpoints and cannot unequivocally be mapped to highly identical duplicates. Long-read sequencing technologies overcome these limitations by providing reads thousands of bases long, but the availability of population cohorts remains limited.This thesis studies primate SVs and SDs characterized using diverse sequencing technologies and assesses their representation in reference genomes, variation across modern populations, their putative molecular impacts, and their roles in evolution and adaptation. We found novel SVs, including 88 deletions and 36 inversions, in two chimpanzee individuals sequenced with nanopore and optical mapping. Deletions and inversion breakpoints were depleted within topologically associated domains but enriched in differentially expressed genes between the two species. Focusing on human SDs, we identified eight Mbp of erroneously collapsed duplications in the human reference genome, impacting 48 protein coding and ten medically relevant genes, that are corrected in the first complete sequence of a human genome, T2T-CHM13. Leveraging this new reference, we identified 417 genes embedded in SDs with over 98% sequence identity (SD-98) that are near copy-number (CN) fixed in modern humans (1000 Genomes Project; 1KGP), 205 genes showing stratification between diverse modern populations (VST>95th percentile), and 22 protein-encoding genes showing consistent Tajima’s D outlier values across all humans examined. Our approach highlighted potential relevant human gene duplications, which are priority candidates for functional studies. Finally, we provide a compendium of tools and practices that we recommend be adopted by computational biologists to increase reproducibility in the field.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View