Skip to main content
eScholarship
Open Access Publications from the University of California

UC Santa Cruz

UC Santa Cruz Electronic Theses and Dissertations bannerUC Santa Cruz

Infrastructure for Scalable Analysis of Genomic Variation

Abstract

The scale of the problems which human genomics is asked to solve necessitates that the field develop an ability to integrate and synthesize information across the entire human population. The abstraction of a single-copy human reference genome assembly, and the linear coordinate space that it induces, are more of a hindrance than a help at these scales. They can only ever represent one sample at any given place, and they make combining information about human variation across multiple studies and modalities difficult. To rectify these problems, I propose the construction and adoption of a graph-based alternative to the human reference genome assembly: a Human Genome Variation Map. I present here four research projects. The first is a theory of mapping to references that is extensible to graphs. The second describes a novel data structure for embedding individual haplotype sequences into a graph reference. The third surveys graph construction techniques to discover methods that produce graphs yielding read mapping and variant calling results superior to those obtained with linear, variation-free references. The fourth extends these improvement results to chromosome-scale graphs constructed from multiple sources and modalities of variation data. These four projects describe a research program aimed towards the eventual release of an official Human Genome Variation Map build, providing a piece of vital infrastructure for the analysis of human genomic variation at population scale.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View