Applications in High-throughput Sequencing Technologies: From Compression Algorithms and Data Warehousing to Understanding Gene Regulation and Diseases at Scale
- Author(s): Rigor, Paul
- Advisor(s): Baldi, Pierre F
- et al.
High-throughput sequencing (HTS) technologies play important roles in the life sciences by allowing the rapid parallel sequencing of very large numbers of relatively short nucleotide sequences, in applications ranging from genome sequencing and resequencing to digital microarrays and ChIP-Seq experiments. As experiments scale up, HTS technologies create new bioinformatics challenges for the storage, distribution, and downstream analyses of HTS data. In this thesis, we address the challenges and opportunities in the storage and downstream analyses of data derived from HTS technologies.
To address the growing amount of HTS data, we develop data structures and compression algorithms for HTS data storage. A processing stage maps short sequences to a reference genome or a large table of sequences. Then the integers representing the short sequence absolute or relative addresses, their length, and the substitutions they may contain are compressed and stored using various entropy coding algorithms, including both old and new fixed codes (e.g Golomb, Elias Gamma, MOV) and variable codes (e.g. Huffman). The general methodology is illustrated and applied to several HTS data sets. Results show that the information contained in HTS files can be compressed by a factor of 10 or more, depending on the statistical properties of the data sets and various other choices and constraints. Our algorithms fair well against general purpose compression programs such as gzip, bzip2 and 7zip; timing results show that our algorithms are consistently faster than the best general purpose compression programs.
Achieving a comprehensive map of all the regulatory elements encoded in the human genome is a fundamental challenge of biomedical research. So far, only a small fraction of the regulatory elements have been characterized, and there is great interest in applying computational techniques to systematically discover these elements. Such efforts, however, have been significantly hindered by the overwhelming size of non-coding DNA regions and the statistical variability and complex spatial organizations of mammalian regulatory elements. The MotifMap system uses databases of transcription factor binding motifs, refined genome alignments, and a comparative genomic statistical approach to provide comprehensive maps of candidate regulatory elements encoded in the genomes of model species.
MotifMap and its integration with other data provide a foundation for analyzing gene reg- ulation on a genome-wide scale, and for automatically generating regulatory pathways and hypotheses. The power of this approach is demonstrated and discussed using the P53 apop- totic pathway and the Gli hedgehog pathways as examples.
Further application of MotifMap is underlined by its integration into CircadiOmics (de- veloped in the Baldi lab), which aims to decode transcriptional machinery that are under circadian control. Here we utilize MotifMap in understanding and delineating the roles of a subset of the sirtuin family of deacetylases in regulating circadian rhythms. Circadian rhythms are intimately linked to cellular metabolism. Specifically, the NAD+-dependent deacetylase SIRT1, the founding member of the sirtuin family, contributes to clock func- tion. Whereas SIRT1 exhibits diversity in deacetylation targets and subcellular localization, SIRT6 is the only constitutively chromatin-associated sirtuin and is prominently present at transcriptionally active genomic loci. Comparison of the hepatic circadian transcriptomes reveals that SIRT6 and SIRT1 separately control transcriptional specificity and therefore define distinctly partitioned classes of circadian genes. SIRT6 interacts with CLOCK:BMAL1 and, differently from SIRT1, governs their chromatin recruitment to circadian gene promot- ers. Moreover, SIRT6 controls circadian chromatin recruitment of SREBP-1, resulting in the cyclic regulation of genes implicated in fatty acid and cholesterol metabolism. This mechanism parallels a phenotypic disruption in fatty acid metabolism in SIRT6 null mice as revealed by circadian metabolome analyses. Thus, genomic partitioning by two independent sirtuins contributes to differential control of circadian metabolism.