Unravelling the structural complexity of the human genome using linked-read sequencing and optical mapping technologies
- Author(s): Wong, Hang Yin Karen
- Advisor(s): Risch, Neil
- et al.
Elucidating the full spectrum of genetic variations across the human population is a fundamental pursuit in scientific research as it underlies the complex interplay between genotype and phenotype. However, most resequencing studies performed to date rely on short read technologies and the lack of long-range sequence information precludes comprehensive structural variations (SVs) analysis. While numerous SV algorithms can pinpoint deletion breakpoints with high sensitivity, it is much more challenging to detect other types of SVs as they cannot be directly inferred through an alignment-based approach. In this dissertation, we leveraged state-of-the-art technologies that allow for long-range genome sequencing and mapping to comprehensively evaluate SVs in the context of population genetics and genomic medicine.
To improve the current representation of the human reference genome, we utilized 10x Genomics (10xG) whole genome linked-read sequencing to generate de novo assemblies of 328 genomes from around the world. We breakpoint-resolved 18Mb of genomic sequences missing from the reference genome—aka Non-reference Unique Insertions (NUIs)—and linearly integrated them into the GRCh38 primary chromosomal assemblies so that these NUIs can be annotated based on the local genomic context. We demonstrated that many of these NUIs can be found in the human transcriptome and hence are likely to have functional significance. Our proof-of-concept reference representation will allow researchers to identify biologically relevant polymorphisms beyond what is currently detected, thus enhancing the interpretability of all existing and future short-read whole genome sequencing datasets.
Furthermore, we applied either linked-reads or in conjunction with Bionano optical mapping in two precision medicine projects to identify disease-causal variants that previously evaded detection. In the first project, whole genome linked-read sequencing was performed on two families with homozygous familial hypercholesterolemia to determine the underlying genetic etiology of the disease. In the second project, we applied both technologies on 50 undiagnosed children with suspected genetic diseases along with their parents. Our automated informatics pipeline identified 16 clinical diagnoses, with which 25% of these cases were attributed to cryptic SVs. Our results substantiated the use of long-range sequencing and mapping in patients with genetic diseases, and their applications in genomic medicine provides a path forward for bringing tremendous precision to clinical diagnosis, thereby fulfilling the promise of individualized medicine.