The healthy function of complex tissues is dependent on a complex combination of cell types properly working together to maintain homeostasis. Diseases or stressful conditions frequently alter the normal mix of cell types found in a healthy tissue, either directly or by eliciting an immune response. The cell type composition of these tissues is, therefore, of natural interest to both researchers and clinicians.
However, quantifying cell type populations has proven to be a challenging and often expensive task. Traditional methods suffer from several limitations and have potential to introduce bias. FACS sorting has been a common approach for many years, but remains slow and expensive, making it difficult to apply to large studies. Single-cell methods are emerging and may become more cost effective in the future, but still present a prohibitive financial barrier for many labs. Moreover, both these technologies fail to capture cells with unusual morphologies. Neurons, myocytes, and adipocytes are too large, unusually shaped, or fragile to be reliably estimated by these methods.
As gene expression data has become more ubiquitous, interest in computational cell type quantification methods have gained interest and popularity. These approaches, termed cell type deconvolution, utilize knowledge of cell type specific gene expression to estimate cell type abundances in samples of unknown composition. However, gene expression deconvolution is a challenging problem, and accurate predictions are sensitive to a number of factors. Many approaches have emerged, but struggle to maintain accurate predictions when faced with novel data from varying platforms, tissue types, or species.
I have developed the Gene Expression Deconvolution Tool (GEDIT), a flexible, robust deconvolution tool that aims to overcome limitations still present in the field. GEDIT is designed to be flexible applicable to a wide range of cell types, platforms, and species. GEDIT utilizes novel techniques for selecting signature genes, which identifies genes with cell type specific expression patterns and improves the speed and accuracy of results. A transformation is also applied, in order to control for the effect of highly expressed genes and further improve quality of results. Lastly, GEDIT applies a linear regression to model the observed. I have applied GEDIT to a number of datasets, including the entire GTEx database.
In addition, I am also performing a large-scale benchmarking project, in which I compare 8 current benchmarking tools (with more to be added) on several datasets of known proportions. This includes a large clinical dataset, with over 5,000 blood samples taken directly from healthy individuals. Cell type quantification for this data has been carried out by physical means, specifically cell electrical impedance counting. This project is comprehensively evaluating the performance of these tools when used with several mixture and reference datasets.