The Gene Expression Deconvolution Interactive Tool (GEDIT): accurate cell type quantification from gene expression data

Abstract Background The cell type composition of heterogeneous tissue samples can be a critical variable in both clinical and laboratory settings. However, current experimental methods of cell type quantification (e.g., cell flow cytometry) are costly, time consuming and have potential to introduce bias. Computational approaches that use expression data to infer cell type abundance offer an alternative solution. While these methods have gained popularity, most fail to produce accurate predictions for the full range of platforms currently used by researchers or for the wide variety of tissue types often studied. Results We present the Gene Expression Deconvolution Interactive Tool (GEDIT), a flexible tool that utilizes gene expression data to accurately predict cell type abundances. Using both simulated and experimental data, we extensively evaluate the performance of GEDIT and demonstrate that it returns robust results under a wide variety of conditions. These conditions include multiple platforms (microarray and RNA-seq), tissue types (blood and stromal), and species (human and mouse). Finally, we provide reference data from 8 sources spanning a broad range of stromal and hematopoietic types in both human and mouse. GEDIT also accepts user-submitted reference data, thus allowing the estimation of any cell type or subtype, provided that reference data are available. Conclusions GEDIT is a powerful method for evaluating the cell type composition of tissue samples and provides excellent accuracy and versatility compared to similar tools. The reference database provided here also allows users to obtain estimates for a wide variety of tissue samples without having to provide their own data.


Introduction 45
Cell type composition is an important variable in biological and medical research. In laboratory 46 experiments, cell sample heterogeneity can act as a confounding variable. Observed changes in gene expression may result from changes in the abundance of underlying cell populations, rather than 48 changes in expression of any particular cell type [1]. In clinical applications, the cell type composition   M  a  t  r  i  x  S  p  e  c  i  e  s  R  e  f  e  r  e  n  c  e  P  l  a  t  f  o  r  m   #  o  f  C  e  l  l  T  y  p  e  s  C  e  l  l  T  y  p  e  s  H  u  m  a  n  S  k  i  n  S  i  g  n  a  t  u  r  e  s   H  u  m  a  n  (  S  w  i  n  d  e  l  l  e  t  a  l  .  2  0  1  3  )  M  u  l  t  i  -M  i  c  r  o  a  r  r  a  y  2  1  I  m  m  u  n

157
10,000 simulated mixtures were generated, each using one of four reference matrices, with either four, 158 five, six, or ten cell types being simulated. Deconvolution was performed using a separate expression 159 matrix than the one used to generate the mixtures. When not otherwise noted, we use the following

163
Preprocessing and Quantile Normalization

164
The first step in the GEDIT pipeline is to render the two matrices comparable. This is done by

181
In order to identify the best signature genes in a given reference matrix, GEDIT calculates a 182 signature score for each gene. By default, this score is computed using the concept of information

194
We also evaluated the effect of accepting more signature genes for some cell types than   234  e  c  o  n  v  o  l  u  t  i  o  n  S  c  o  r  e  s  3  I  m  m  u  n  e  H  u  m  a  n  ,  M  o  u  s  e   C  I  B  E  R  S  O  R  T  (  a  b  s  o  l  u  t  e  m  o  d  e  )   N  e  w  m  a  n  e  t  .  a  l  .  ,  2  0  1  5  Y  e  s  D  e  c  o  n  v  o  l  u  t  i  o  n  S  c  o  r  e  s  1  I  m  m  u  n  e  H  u  m  a  n   S  a  V  a  n  t   L  o  p  e  z  e  t  .  a  l  .  ,  2  0  1  7  Y  e  s  ,  i  f  m  a  r  k  e  r  g  e  n  e  s  s  p  e  c  i  f  i  e  d  M  a  r  k  e  r  G  e  n  e  s  S  c  o  r  e  s  1  2  I  m  m  u  n  e  a  n  d  S  t  r  o  m  a  l  H  u  m

252
The optimal choice of reference matrix varies greatly depending on the exact combination of tool,

280
We also perform 2 additional comparisons between GEDIT and other deconvolution tools.

281
Firstly, we create 100 simulated mixtures of pancreatic cells (alpha, beta, gamma, delta) using single 282 cell data from a recent single cell experiment (details in supplementary materials). We evaluate the 283 accuracy of each tool when used to predict the cell type content of these synthetic mixtures, and

284
GEDIT provides the lowest overall error (Supplementary Figure 3).

285
Lastly, we perform an evaluation of runtime required for each tool. We randomly select batches

311
GEDIT can be used to decompose data from any organism for which reference data is 312 available. Here, we demonstrate the efficacy of GEDIT when applied to the Mouse Body Atlas, a 313 collection of tissue and cell type samples collected from mice [23]. As reference data, we assembled a

420
Warnings indicated that three samples suffered from low replicate concordance and one sample from 421 low read depth, and these samples were excluded. All samples were processed by the Gingeras Lab 422 at Cold Spring Harbor and mapped to GRCH38.

423
The samples were quantile normalized and clustered. In cases where multiple transcripts were 424 measured for a single gene, the expression of that gene was calculated as the sum of all transcripts.

425
At this time, 18 additional samples were excluded as they did not cluster with their replicates. Based

433
We also combined the ENCODE and BLUEPRINT reference matrices into a single reference 434 matrix, which we call BlueCode. We combined, then quantile normalized, the columns of both

488
However, since dendritic cells were never present at more than 3.5% abundance, we did not evaluate 489 performance for this cell type.

497
xCell produces 67 output scores, seven of which were used in this study. These were the 498 entries labelled "B-Cells", "Macrophages", "Monocytes", "NK cells", "Neutrophils", "CD4+ T cells" and 499 "CD8+ T Cells". As suggested by the xCell authors, the outputs for CD4 and CD8 T cell subtypes were 500 summed to produce a final output for total T cells.

502
Reference Data

503
We evaluated the performance of the four reference-based tools (GEDIT, CIBERSORT,

505
BLUEPRINT, and the Human Primary Cell Atlas).The BLUEPRINT and Human Primary Cell Atlas 506 reference matrices differ from ImmunoStates and LM22 in that they contain tens of thousands of 507 genes, many of which should not be considered signature genes. This contrasts to ImmunoStates and 508 LM22; each reference matrix contains fewer than 600 genes, which have been specifically identified 509 as signature genes by previous work [14,21]. We include both forms of reference matrices in order to evaluate the input requirements of the tools studied. memory), Monocytes (CD14 and CD16), NK cells (resting and active) and T cells (many subtypes including varieties of CD4 and CD8). In each case, the outputs for each sub-type were summed in 515 order to produce a total score for each greater cell type.

527
As part of this project, we perform a study in which we compare the performance of several 528 deconvolution tools using multiple metrics. Unlike previous evaluation studies, we explore the effect of 529 reference choice by running tools multiple times with reference data from different sources. Choice of 530 optimal reference has a large impact on the accuracy of many tools, but GEDIT provides robust 531 performance and accurate estimates for many possible reference choices. While all efforts were taken 532 to perform this comparison in an unbiased manner, the authors note that development of the tool was 533 still underway when the first comparisons were made. All code and inputs used to reproduce this study code, which is limited by copyright.

536
The high performance of GEDIT is due to two key innovations. Firstly, signature gene selection 537 by information entropy serves to select genes that are the most informative for deconvolution.

538
Secondly, the row scaling step, which aims to equally weight all signature genes into the final 539 estimate, even those with comparatively low expression. In addition, the flexibility of GEDIT and the 540 diverse set of reference matrices we provide allows GEDIT to be easily applied in a wide range of 541 circumstances.

542
The output of GEDIT represents the fraction of mRNA originating from each cell type. This is 562 matrices for varied cell types for both mouse and human datasets.

566
GEDIT provides unique advantages to researchers, especially in terms of cell type, species 567 and platform flexibility, and constitutes a useful addition to the existing set of tools for tissue 568 decomposition. Our efficient decomposition methodology has been extensively optimized and we find 569 that it performs robustly across a broad range of tissues in both mouse and human datasets. Our