Compressing and Querrying the Human Genome
- Author(s): Kozanitis, Christos A.;
- et al.
With high throughput DNA sequencing costs dropping below $1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. In order to address the large-data challenges on genomics, this thesis advocates : 1) A highly efficient read-level compression of the data which is achieved through reference-based compression by a tool called SLIMGENE and 2) a clean separation between evidence collection and inference in variant calling which is achieved though our Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. The first contribution, SLIMGENE, introduces a set of domain specific lossless compression schemes that achieve over 40x compression of the ASCII representation of short reads, outperforming bzip2 by over 6x. Including quality values, we show 5x compression using less running time than bzip2. Secondly, given the discrepancy between the compression factor obtained with and without quality values, we initiate the study of using lossy transformations of the quality values. Specifically we show that a lossy quality value quantization results in 14x compression but has minimal impact on downstream applications like SNP calling that use quality values. The second contribution, GQL, introduces a novel framework for querying large genomic datasets. We provide a number of cases to showcase the user of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5-10 lines of code and search large datasets ( 100GB) in only a few minutes on a cheap desktop computer. We show that GQL is faster and more concise than writing equivalent queries in existing frameworks such as GATK. We show that existing callers by an order of magnitude by using GQL to retrieve evidence. We also show how GQL output can be visualized using the UCSC browser