Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Overcoming the Common Challenges in Differential Gene Expression Analysis Studies

Abstract

The ability to analyze gene expression data has had a fundamental impact in the biological sciences and on our understanding of the causes and mechanisms of disease. However, a significant statistical challenge is posed by the combination of the small number of replicates together with the large number of genes leading to an undesirable level of misclassified genes when identifying genes with differential expression levels. When multiple gene expression data sets are generated under the same set of experimental conditions, the ques- tion arises as to how to efficiently combine this information. Several methods in the literature have been suggested to aggregate ranked data from multiple sources. We introduce a new classifier, underpinned by Bayesian principles, called Peer Reinforced Ranker (PR-Ranker) which uses density estimation to approximate the probability that a gene is differentially expressed given a collection of ranked lists.

Our classifier is amenable to theoretical analysis when the number of genes and lists is large using the theory of large deviations. Under modest technical assumptions we show that asymptotically PR-Ranker has the smallest loss of any rank aggregation procedure. Moreover, we prove that other more ad hoc methods, such as Borda, have a strictly higher asymptotic rate of loss.

While the theoretical results are asymptotic, we perform a series of simulation studies that demonstrate that our classifier outperforms existing methods on datasets of realistic size for biological data. Furthermore, we show that the outperformance is even greater when the lists exhibit varying levels of noise or when some sources are corrupted. PR-Ranker automatically adapts to varying data quality and efficiently combines the data from different sources. Finally we apply PR-Ranker to a gene expression data set in a preeclampsia study. The top ranked genes identified were known to be biologically relevant to preeclampsia and our method achieved a substantially higher Consistency Index than other rank aggregation procedures.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View