Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Statistical and Computational Methods for Analyzing High-Throughput Genomic Data

Abstract

In the burgeoning field of genomics, high-throughput technologies (e.g. microarrays, next-generation sequencing and label-free mass spectrometry) have enabled biologists to perform global analysis on thousands of genes, mRNAs and proteins simultaneously. Extracting useful information from enormous amounts of high-throughput genomic data is an increasingly pressing challenge to statistical and computational science. In this thesis, I will address three problems in which statistical and computational methods were used to analyze high-throughput genomic data to answer important biological questions.

The first part of this thesis focuses on addressing an important question in genomics: how to identify and quantify mRNA products of gene transcription (i.e., isoforms) from next-generation mRNA sequencing (RNA-Seq) data? We developed a statistical method called Sparse Linear modeling of RNA-Seq data for Isoform Discovery and abundance Estimation (SLIDE) that employs probabilistic modeling and L1 sparse estimation to answer this ques- tion. SLIDE takes exon boundaries and RNA-Seq data as input to discern the set of mRNA isoforms that are most likely to present in an RNA-Seq sample. It is based on a linear model with a design matrix that models the sampling probability of RNA-Seq reads from different mRNA isoforms. To tackle the model unidentifiability issue, SLIDE uses a modified Lasso procedure for parameter estimation. Compared with existing deterministic isoform assembly algorithms, SLIDE considers the stochastic aspects of RNA-Seq reads in exons from different isoforms and thus has increased power in detecting more novel isoforms. Another advantage of SLIDE is its flexibility of incorporating other transcriptomic data into its model to further increase isoform discovery accuracy. SLIDE can also work downstream of other RNA-Seq assembly algorithms to integrate newly discovered genes and exons. Besides isoform discovery, SLIDE sequentially uses the same linear model to estimate the abundance of discovered isoforms. Simulation and real data studies show that SLIDE performs as well as or better than major competitors in both isoform discovery and abundance estimation.

The second part of this thesis demonstrates the power of simple statistical analysis in correcting biases of system-wide protein abundance estimates and in understanding the rela- tionship between gene transcription and protein abundances. We found that proteome-wide surveys have significantly underestimated protein abundances, which differ greatly from previously published individual measurements. We corrected proteome-wide protein abundance estimates by using individual measurements of 61 housekeeping proteins, and then found that our corrected protein abundance estimates show a higher correlation and a stronger linear relationship with mRNA abundances than do the uncorrected protein data. To estimate the degree to which mRNA expression levels determine protein levels, it is critical to measure the error in protein and mRNA abundance data and to consider all genes, not only those whose protein expression is readily detected. This is a fact that previous proteome-widely surveys ignored. We took two independent approaches to re-estimate the percentage that mRNA levels explain in the variance of protein abundances. While the percentages estimated from the two approaches vary on different sets of genes, all suggest that previous protein-wide surveys have significantly underestimated the importance of transcription.

In the third and final part, I will introduce a modENCODE (the Model Organism ENCyclopedia Of DNA Elements) project in which we compared developmental stages, tis- sues and cells (or cell lines) of Drosophila melanogaster and Caenorhabditis elegans, two well-studied model organisms in developmental biology. To understand the similarity of gene expression patterns throughout their development time courses is an interesting and important question in comparative genomics and evolutionary biology. The availability of modENCODE RNA-Seq data for different developmental stages, tissues and cells of the two organisms enables a transcriptome-wide comparison study to address this question. We undertook a comparison of their developmental time courses and tissues/cells, seeking com- monalities in orthologous gene expression. Our approach centers on using stage/tissue/cell- associated orthologous genes to link the two organisms. For every stage/tissue/cell in each organism, its associated genes are selected as the genes capturing specific transcriptional activities: genes highly expressed in that stage/tissue/cell but lowly expressed in a few other stages/tissues/cells. We aligned a pair of D. melanogaster and C. elegans stages/tissues/cells by a hypergeometric test, where the test statistic is the number of orthologous gene pairs associated with both stages/tissues/cells. The test is against the null hypothesis that the two stages/tissues/cells have independent sets of associated genes. We first carried out the alignment approach on pairs of stages/tissues/cells within D. melanogaster and C. elegans respectively, and the alignment results are consistent with previous findings, supporting the validity of this approach. When comparing fly with worm, we unexpectedly observed two parallel collinear alignment patterns between their developmental timecourses and several interesting alignments between their tissues and cells. Our results are the first findings regarding a comprehensive comparison between D. melanogaster and C. elegans time courses, tissues and cells.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View