Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Statistical Problems in DNA Microarray Data Analysis

Abstract

DNA microarrays are powerful tools for functional genomics studies. Each array contains thousands of microscopic spots of DNA oligonucleotides with specific sequences, which can hybridize with their complementary DNA sequences. Thus each microarray experiment consists of parallel assays about thousands of genomic fragments. This thesis concerns some statistical issues in the analysis of DNA microarray data.

One common usage of DNA microarrays is to monitor the dynamic levels of gene expression in response to a stimulus. This is often achieved through a time course experiment, in which RNA samples are extracted at various time points after exposing the organism to the stimulus. A particularly interesting type of time course experiments involve replicated series of longitudinal samples. In 2006, Tai and Speed proposed a multivariate empirical Bayes model for analyzing this type of data. The MB-statistic derived from this model was shown useful for ranking the genes according to changes in their temporal expression profiles. In the first part of this thesis, we propose an empirical Bayes false discovery rate (FDR)-controlling procedure for multiple hypothesis testing using the MB-statistic. A null distribution is obtained using the parametric bootstrap. Critical values are determined according to the empirical Bayes FDR procedure. This method was compared, through simulations, to the frequentist FDR procedure, which requires a theoretical null distribution for calculating the nominal p-values. Although our method is slightly anti-conservative, it is more robust to the variability in the estimates of the hyperparameters, when the degree of moderation is small.

Another common usage of DNA microarrays is to detect genomic locations that are associated with DNA-binding proteins. This is often achieved through ChIP-chip experiments that combine chromatin immunoprecipitation with the microarray technology. Traditional DNA microarrays designed for gene expression studies contain only a few probes for each gene. A special type of DNA microarrays, called tiling arrays, are often used in ChIP-chip experiments. They typically contain probes that are placed densely along the chromosomes to cover either the entire genome or contigs of the genome. A couple of challenges in the analysis of ChIP-chip tiling array data have not been met satisfactorily in the literature. When large scale genomic studies are carried over a long period of time, tiling arrays with different probe designs are often used for practical reasons. The first challenge is the integration of replicate experiments performed using different tiling array designs. When the biological process of interest involves a large protein complex, the investigators often perform ChIP-chip experiments on each component DNA-binding protein individually. DNA targets that are shared by the individual proteins are thought to be the localization sites of the protein complex. The second challenge is the joint analysis of multiple DNA-binding proteins, aimed at identifying their shared targets. In the second part of this thesis, we propose a nonhomogeneous hidden Markov model (HMM) for addressing these two challenges. The nonhomogeneous time axis represents the genomic positions of the probes. The hidden states represent the binding statuses of the proteins. The state-conditional emission distributions of the tiling array data are protein-specific and design-specific. We derived a modified Baum-Welch algorithm for fitting the model parameters. We also developed a procedure that converts the probe level summaries into peaks, which represent the putative binding sites, based on both signal strength and peak shape. To compare our method with existing methods, we curated a set of positive and negative genomic regions from a C. elegans dataset, and performed some receiver operating characteristics (ROC) analyses. When applied to each experiment separately, our method performs similarly as the three best existing methods. When applied to the combined data set, which consists of tiling arrays with different probe designs, our method shows a drastic improvement in performance. A generalization of the nonhomogeneous HMM enables the joint analysis of the ChIP-chip data of multiple proteins. We present an application of this method to identify the shared localization sites of two DNA-binding proteins, under two different conditions.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View