Addressing the Omics Data Explosion: a Comprehensive Reference Genome Representation and the Democratization of Comparative Genomics and Immunogenomics
- Author(s): Nguyen, Ngan Kim
- Advisor(s): Haussler, David
- et al.
Advancements in technologies have resulted in an explosion of data, the volume of which continues to increase at an exponential rate. The accumulating wealth of data is enabling numerous new research possibilities and is transforming the world profoundly. In genomics, new genomes are being regularly sequenced, with a growing number of individual genomes becoming available for many species. As the ability to have complete genomic information becomes the norm, the need for a reference genome that better represents the particular species population intensifies: It becomes important to utilize the newly emerged sequences to improve current references and ensure better quality for future assemblies and experiments. Additionally, the proliferation of data has necessitated the decentralization of computational resources together with the empowerment of users to a do-it-yourself system, in which users create their own assemblies, alignments, visualizations and analyses. This is because with the accelerating amount of data, it is impossible and undesirable to maintain the infrastructure model in which only a number of specialized institutions handle most if not all of the data and analyses.
Joining many other on-going efforts, the works in this dissertation attempt to address some of these rising demands. First, I describe the problem of constructing a pan-genome reference for a population and demonstrate that the resulting pan-genome reference is more representative of the population than is any individual genome, using both simulated and real data. Second, I describe a comparative genomic framework that allows for easy generation of collections of web accessible UCSC genome browsers interrelated by an alignment. The pipeline, named the comparative assembly hub (CAH) pipeline, is intended to democratize UCSC comparative genomic resources and facilitate public sharing via the internet. As a demonstration, I create comparative assembly hubs for 66 Escherichia coli/Shigella genomes and highlight comparative analyses on their pan-genomic, core genomic and phylogenetic relationships. Last, I report on comprehensive assessments of the T cell receptor (TCR) repertoires of the autoimmune disease Ankylosing Spondylitis and show example comparative analyses for finding evidence of antigen selection and identifying potential disease-associated clones. In addition, I describe an open-source software package for profiling and comparing TCR sequencing data, called the “Adaptive IMmunoSequencing ToolKit”, or the aimseqtk package. The aimseqtk package is comprised of four main components addressing common analyses of this type of data: clone tracking, repertoire profiling, public clone identification and publication mining.