Algorithms for Determining Differentially Expressed Genes and Chromosome Structures From High-Throughput Sequencing Data
- Author(s): Yang, Yi-Wen
- Advisor(s): Jiang, Tao
- et al.
Next-generation sequencing (NGS) technologies are able to sequence DNA or RNA molecules at unprecedented speed and with high accuracy. Recently, NGS technologies have been applied in a variety of contexts, e.g., whole genome sequencing, transcript expression profiling, chromatin immunoprecipitation sequencing, and small RNA sequencing, to accelerate genomic researches. The size of NGS data is usually gigantic such that the data analysis in these applications of NGS largely relies on efficient computational methods. Due to the critical demand for high performance computational algorithms, in the past few years, my research interest was focused on designing novel algorithms to address challenges in NGS data analysis. The main theme of this dissertation includes algorithmic solutions to three crucial problems in NGS data analysis, two arising from differential expression analysis using high-throughput mRNA sequencing (RNA-Seq) and the other from chromosome structure capture using high-throughput DNA sequencing (Hi-C). (1) In differential expression analysis of RNA-Seq data, long or highly expressed genes are more likely to be detected by most of existing computational methods. However, such bias against short or lowly expressed genes may distort down-stream data analysis at system biology level. To further improve the sensitivity to short or lowly expressed genes, we designed a new computational tool, called MRFSeq, to combine both gene coexpression and RNA-Seq data. The performance of MRFSeq was carefully assessed using simulated and real benchmark datasets and the experimental results showed that MRFSeq was able to provide more accurate prediction in calling differentially expressed genes than the other existing methods such that the distortion due to the bias against short and lowly expressed genes was significantly alleviated. (2) Most of the existing differential expression analysis tools are developed for comparing RNA-Seq samples between known biological conditions. Hoever, the differential expression analysis is also important to other biological researches where the predefined conditions of samples are not available as a priori. For example, differential expressed transcripts can be used as biomarkers to classify a cohort of cancer samples into subtypes such that better diagnosis and therapy methods can be developed for each subtype. So, the first computational method, called SDEAP, was proposed to identify differential expressed genes and their alternative splicing events without the requirement of the predefined conditions. SDEAP provided accurate predition in our experiments on simulated and real datasets. The utility of SDEAP was further demonstrated by classifying subtypes of breast cancer, cell types and the cycle phases of mouse cells. (3) Chromosome structures in nucleus play important roles in biological processes of cells. The Hi-C technology allows biology researchers to reconstruct the three dimensional structures of chromosomes in nucleus of cells on a genome-wide scale and thus serves as a vital component in studies of chromosome structures. During the experimental steps of Hi-C, systematic biases may be introduced into Hi-C data. Hence, eliminating the systematic biases is essential to all the applications using Hi-C data. We developed an improved bias reduction algorithm, called GDNorm. By taking advantages of a Poisson regression model that explicitly formulates the causal relationship of Hi-C data, systematic biases and spatial distances in chromosome structures, our experimental results showed that GDNorm was able to remove the biases from Hi-C data such that the corrected Hi-C data could lead to accurate reconstruction of chromosome structures. In the near future, with the rapid accumulation of NGS data, we expect these efficient computational methods to become valuable tools for discovering novel biological knowledge and benefit numerous genomic researches.