Post-transcriptional regulation plays a central role in the flow of information from genotypes to phenotypes in the cellular machinery. Disruptions of post-transcriptional regulatory mechanisms underlie many human diseases. As high-throughput sequencing technology becomes the standard protocol in studying post-transcriptional regulation, large-scale data in public domain provides an unprecedented resource to understand the complex regulatory networks of gene regulation, while also presents challenges for the development of computational methods to analyze and interpret empirical data into biological knowledge. In this dissertation, novel statistical models and computational frameworks were developed to elucidate post-transcriptional gene regulation using high-throughput sequencing data. Utilizing these new tools, we demonstrated that we can robustly characterize the molecular signals and variations across diverse biological states, and more importantly, identify bona fide regulatory events that are inaccessible by conventional analyses.
The first part of the dissertation describes CLIP-seq Analysis of Multi-mapped reads (CLAM), a comprehensive computational pipeline for analyzing Crosslinking or RNA immunoprecipitation followed by sequencing (CLIP/RIP-seq) data. As CLIP-seq/RIP-seq reads are short, existing computational tools focus on uniquely mapped reads, while reads mapped to multiple loci are discarded. CLAM uses an expectation-maximization algorithm to assign multi-mapped reads and calls peaks combining uniquely and multi-mapped reads. CLAM recovered a large number of novel RNA regulatory sites inaccessible by uniquely mapped reads in datasets with different regulatory features, providing a useful tool to discover novel protein-RNA interactions and RNA modification sites from CLIP-seq and RIP-seq data.
The second part of the dissertation presents Deep-learning Augmented RNA-seq analysis of Transcript Splicing (DARTS), a novel computational framework that integrates deep learning-based predictions with empirical RNA-seq datasets to infer differential alternative splicing between biological conditions. A major limitation of RNA sequencing (RNA-seq) analysis of alternative splicing is its reliance on high sequencing coverage. DARTS employs a deep neural network (DNN) that predicts differential alternative splicing using cis RNA sequence features and trans RNA binding protein levels. DARTS DNN trained on public RNA-seq displays a high prediction accuracy and generalizability. Incorporating DARTS DNN prediction as an informative prior significantly improves the inference of differential alternative splicing. DARTS leverages public RNA-seq big data to provide a knowledge base of splicing regulation via deep learning, thereby helping researchers better characterize alternative splicing using RNA-seq datasets even with modest coverage.