Gene expression analysis provides the link between genome information and phenotype, and is widely used in biomedical research. With the rapid advance of high-throughput technology, it is feasible to measure global mRNA expression in multiple samples at low cost. Over the past decade, many computational and statistical methods have been developed to interpret large-scale gene expression data. However, two questions still have not been thoroughly investigated: 1) how to study gene expression preservation across different tissues, like between brain and blood; and 2) how to analyze the gene expression data generated from heterogeneous tissues comprised of many cell types?
Blood samples are an important surrogate to study neurological diseases due to the limited access of brain samples. My dissertation first investigated the gene expression preservation between brain and blood by cross-referencing three brain expression data sets (from cortex, cerebellum and caudate nucleus) with two large blood data sets. While previous studies have focused on the preservation of individual gene expression levels across the two tissues, I utilized a systems biology approach to study the preservation of gene co-expression modules. Since oligodendrocytes, astrocytes, and neurons are not present in blood, it is not surprised that only a handful of human brain modules showed evidence of preservation in human blood while global transcriptome organization is poorly preserved. These shared relationships characterized here may aid future efforts to identify blood biomarkers for neurological and neuropsychiatric diseases when brain tissue samples are unavailable.
For the second question, several previous publications have proposed gene expression deconvolution methods, including estimating cell abundances or cell type-specific gene expression (CTSE) values, for admixed samples comprised of distinct cell types. These methods have not yet been widely adopted since comprehensive empirical evaluations are needed to assess their reliability. Here I evaluated different types of expression deconvolution methods in four empirical data sets, including a neuro-scientific application. Since cell type-specific estimation of the mean value for individual genes is sometimes problematic, we propose to consider sets of genes (as opposed to individual genes) and show that this can increase the accuracy of CTSE estimation. Furthermore, comprehensive simulation studies are used to evaluate the effect of mis-specifying cell types. Our simulations indicated that erroneously omitting cell types from the analysis only has an adverse effect on CTSE estimation if the omitted cell type has a high abundance. We also present two R functions, proportionsInAdmixture and populationMeansInAdmixture, which implement cell abundance estimation and CTSE estimation, respectively.