scDesign2: an interpretable simulator that generates high-fidelity single-cell gene expression count data with gene correlations captured

In the burgeoning field of single-cell transcriptomics, a pressing challenge is to benchmark various experimental protocols and numerous computational methods in an unbiased manner. Although dozens of simulators have been developed for single-cell RNA-seq (scRNA-seq) data, they lack the capacity to simultaneously achieve all the three goals: preserving genes, capturing gene correlations, and generating any number of cells with varying sequencing depths. To fill in this gap, here we propose scDesign2, an interpretable simulator that achieves all the three goals and generates high-fidelity synthetic data for multiple scRNA-seq protocols and other single-cell gene expression count-based technologies. Compared with existing simulators, scDesign2 is advantageous in its transparent use of probabilistic models and is unique in its ability to capture gene correlations via copula. We verify that scDesign2 generates more realistic synthetic data for four scRNA-seq protocols (10x Genomics, CEL-Seq2, Fluidigm C1, and Smart-Seq2) and two single-cell spatial transcriptomics protocols (MERFISH and pciSeq) than existing simulators do. Under two typical computational tasks, cell clustering and rare cell type detection, we demonstrate that scDesign2 provides informative guidance on deciding the optimal sequencing depth and cell number in single-cell RNA-seq experimental design, and that scDesign2 can effectively benchmark computational methods under varying sequencing depths and cell numbers. With these advantages, scDesign2 is a powerful tool for single-cell researchers to design experiments, develop computational methods, and choose appropriate methods for specific data analysis needs.


Introduction
The recent development of the single-cell RNA-seq (scRNA-seq) technologies has revolutionized transcriptomic studies by revealing the genome-wide gene expression levels within individual 1 ZINB-WaVE was not proposed as a simulator in its original publication [38] but was later implemented as a simulator in the splatter package [65]. 2 A quote from the SERGIO paper [72]: "It is worth noting here that several existing single-cell expression simulators employ a probabilistic model whose parameters are directly estimated from a real dataset and then sample synthetic data from the model. This approach is not feasible in SERGIO since the true GRN underlying the real dataset is unknown and notoriously hard to reconstruct, and the explicit use of a GRN is a crucial distinguishing feature of SERGIO. As such, SERGIO uses a randomly generated GRN to first synthesize clean expression data and uses the real dataset only in the second phase, to determine the extent of technical noise to add to the clean data." to capture gene correlations, it cannot vary sequencing depth (not satisfying property 4), and its black-box nature, requirement for GPU, and long computational time make it fail properties 5 and 6. Hence, a simulator that achieves all the six properties is in demand.
Here we propose scDesign2 as the first simulator that achieves all the six properties and generates realistic synthetic data for multiple single-cell gene expression count-based technologies.
Inheriting its name from our previous simulator scDesign, scDesign2 has achieved a significant methodological advance and become the first interpretable simulator that reliably captures gene correlations. This advance is enabled by probabilistic modeling of not only marginal distributions of individual genes but also the joint distribution of thousands of genes. Thanks to its achievement of the six properties, scDesign2 will serve as a powerful tool for guiding experimental design and benchmarking computational methods in the single-cell transcriptomics field.

An overview of scDesign2
The statistical framework of scDesign2 consists of two steps: (1) model-fitting and (2) synthetic data generation (Fig. 1). In the model-fitting step, scDesign2 fits a multivariate generative model to a real scRNA-seq dataset. If the dataset contains more than one cell type (defined by marker genes or cell clustering; see Methods), then scDesign2 divides the dataset into subsets, one per cell type, and fits a cell-type-specific model to each subset. In the data-generation step, scDesign2 generates synthetic scRNA-seq data from the fitted model for each cell type.
The model-fitting step is composed of the following two sub-steps. First, scDesign2 fits a univariate count distribution to each gene's counts in cells of the same type. Four count distributions are considered: Poisson, zero-inflated Poisson (ZIP), negative binomial (NB), and zeroinflated negative binomial (ZINB), with the former three as special cases of the ZINB. All the four distributions have been widely used to model a gene's read or UMI counts in a homogeneous group of cells [38,[73][74][75]. From these four distributions, scDesign2 chooses one distribution for every gene in every cell type in a data-driven way. Second, scDesign2 captures the correlations of thousands of genes (all the moderately to highly expressed genes) by fitting a Gaussian copula in each cell type. We choose the Gaussian copula for its easiness to fit and good interpretability, and we find it capturing gene correlations well ( Fig. 3 and Supplementary Figs. S12-S16).
As the first simulator that explicitly captures gene correlations, scDesign2 leverages a unique advantage of the copula framework: the separate modeling of each gene's marginal distribution and the correlation structure of thousands of genes together. This separation and its resulting flexibility are critical for scDesign2 to model single-cell gene expression count data generated by various experimental protocols. Thanks to this flexibility, scDesign2 can choose a count distribution from Poisson, ZIP, NB, and ZINB to fit each gene's expression counts and reveal biological insights of that gene's expression pattern.
For each dataset, we randomly split its cells into two halves, with one half ("training data") to be used to train every simulator and the other half ("test data") to serve as the benchmark standard to compare with the synthetic data generated by each simulator.
We use three sets of benchmark analyses to compare synthetic data with the corresponding test data. First, we select three cell types from each dataset (measured by each experimental protocol), obtaining a total of 12 cell-type-protocol combinations. For each combination, we evaluate eight key statistics: four gene-wise (expression mean, variance, coefficient of variation (cv), and zero proportion); two cell-wise (zero proportion and library size); two gene-pair-wise (Pearson correlation and Kendall's tau). (Note that we include Kendall's tau instead of Spearman rank correlation as a rank-based correlation statistic because Kendall's tau can account for ties.) For each statistic, we compare its empirical distribution-across genes (for gene-wise statistics), cells (for cell-wise statistics), or gene-pairs (for gene-pair-wise statistics)-in the test data with that in the synthetic data generated by each simulator. For the four gene-wise and two gene-pair-wise statistics, we also directly compare their values in the test data with those in the synthetic data generated by scDesign2, ZINB-WaVE, SPARSim, and scGAN-the four simulators that preserve genes. The results are summarized in Fig. 2 and Supplementary Figs. S1-S11. Second, for each of the 12 cell-type-protocol combinations, we compare the gene correlation matrix estimated from the test data with that from the synthetic data generated by each simulator that preserves genes.
We exclude the simulators that do not preserve genes because the gene expression matrices estimated from their synthetic data do not align with those from real data (i.e., matrices have different dimensions). The results are summarized in Fig. 3 and Supplementary Figs. S12-S16.
Third, for each of the four protocols, we use 2D visualization-t-SNE and PCA-to compare cells of multiple types in the test data and the synthetic data generated by each simulator that preserves genes. Again, we exclude the simulators that do not preserve genes because their synthetic cells cannot be combined with real cells for joint visualization (dimesionality reduction requires all cells to have the same original dimensions, i.e., genes). The results are summarized in Fig. 4 and Supplementary Figs. S17-S19.
Overall, we find that the synthetic data generated by scDesign2 most resemble the test data of all the four experimental protocols. In our first set of analyses, we categorize the eight existing 5 simulators into two types: simulators that preserve genes (ZINB-WaVE, SPARSim, and scGAN) and others. By comparing the distributions of eight key statistics between test data and synthetic data, we find that the simulators capable of preserving genes have overall better performance than other simulators, across cell types and protocols ( Fig. 2a and Supplementary Figs. S1a-S11a).
Then we further compare these simulators that preserve genes by directly comparing the genewise and gene-pair-wise statistics' values between their synthetic data and test data. Note that we cannot compare these statistics' values for simulators that do not preserve genes because the "genes" in those simulators' synthetic data cannot be matched to any genes in real data. In detail, we calculate the mean-squared errors (MSEs) of the four gene-wise statistics and the two genepair-wise statistics between test data and synthetic data generated by scDesign2, ZINB-WaVE, SPARSim, and scGAN. Fig. 2b shows that scGAN, a deep-learning-based method, consistently has the worst MSEs for all the six statistics. Due to its long computational time 3 , difficult implementation, and unsatisfactory performance, we exclude it from the following comparisons. Out of 48 comparisons for gene-wise statistics (4 statistics times 12 cell-type-protocol combinations), scDesign2 achieves the best MSEs in 37 comparisons and demonstrates a clear advantage over ZINB-WaVE and SPARSim ( Fig. 2b and Supplementary Figs. S1b-S11b). However, out of 24 comparisons for gene-pair-wise statistics (2 statistics times 12 cell-type-protocol combinations), scDesign2 only achieves the best MSEs in 5 comparisons and is outperformed by SPARSim.
Inspecting the results, we find that the MSE is not an informative measure of how a simulator preserves correlations, because its value is dominated by most gene pairs that have correlations close to 0. Hence, we examine correlations of individual gene pairs and observe that scDesign2 can preserve strong negative gene correlations missed by ZINB-WaVE and SPARSim, which wrongly capture these correlations as weak or even positive (Fig. 2c-d and Supplementary Figs. S1c-d-S11c-d). This observation is further confirmed by our second set of analyses below. Furthermore, we compare the relationships of three pairs of gene-wise statistics (zero proportion vs. mean, variance vs. mean, and cv vs. mean) between test data and synthetic data generated by each simulator, and we find that scDesign2 better captures the relationships than existing simulators do across cell types and experimental protocols ( Fig. 2e and Supplementary Figs. S1e-S11e).
In our second set of analyses, we compare gene correlation matrices in terms of both Pearson correlation and Kendall's tau between test data and synthetic data generated by scDesign2, ZINB-WaVE, and SPARSim. Heatmap visualization shows that scDesign2 captures gene correlations most accurately and consistently across cell types and experimental protocols ( Fig. 3 and Supplementary Figs. S12-S16. Notably, for highly expressed genes in Smart-Seq2 data, ZINB-WaVE In our third set of analyses, we use 2D visualization to compare cells in test data and those in synthetic data generated by scDesign2, ZINB-WaVE, and SPARSim. Both t-SNE and PCA 2D plots show that cells in synthetic data generated by scDesign2 most resemble cells in test data ( Fig. 4 and Supplementary Figs. S17-S19). In particular, by overlaying real and synthetic cells in the same 2D plot, we find synthetic cells generated by scDesign2 least distinguishable from real cells. On the contrary, synthetic cells generated by ZINB-WaVE and SPARSim exhibit spurious patterns unseen in real cells. These results again confirm the superb performance and the realistic nature of scDesign2.
These three sets of analyses also verify the advantage of using copula in scDesign2. Compared with scDesign2, its w/o copula variant, as expected, cannot capture gene correlations at all (Fig. 2a, Fig. 3, Supplementary Figs. S1a-S11a, and Supplementary Figs. S12-S16). As a result, the synthetic data generated by the w/o copula variant do not resemble the corresponding real data in 2D visualization ( Fig. 4 and Supplementary Figs. S17-S19).
In addition to its realistic nature, scDesign2 also has two more unique advantages over ZINB-WaVE and SPARSim. Unlike the other two simulators, scDesign2 only considers genes as features and models their joint distribution, and it regards cells as observations instead of features. This formulation is aligned with the statistical thinking that genes are fixed quantities but cells are randomly sampled from a population of cells. Thanks to this principled formulation, scDesign2 can generate synthetic cells of any number, in contrast to ZINB-WaVE and SPARSim that can only generate the same number of synthetic cells as real cells. It is also worth noting that, although scDesign2 does not explicitly model the distribution of cell library sizes, it recovers that distribution rather faithfully (see the cell library size distributions in Fig. 2a and Supplementary Figs. S1a-S11a). This is achieved by modeling joint gene distributions and accounting for gene correlations through the use of copula. Compared to scGAN, the training of scDesign2 is fast and does not rely on a large number of input real cells for good training quality.

Application 1: scDesign2 generates realistic synthetic data for other single-cell expression count-based technologies
Beyond scRNA-seq data, we demonstrate that scDesign2 can also generate realistic synthetic data for other single-cell count-based technologies that do not necessarily use next-generation sequencing, as long as individual genes' count distributions can be well approximated by Poisson, ZIP, NB or ZINB. For instance, single-cell spatial transcriptomics technologies, usually based on fluorescence in situ hybridization (FISH), are known to yield Poisson or NB distributed counts [80,81]. The versatility of scDesign2 is endowed by its data-driven way of selecting marginal distributions for individual genes, regardless of each distribution being Poisson or NB, zero-inflated or not.
We demonstrate the accuracy of scDesign2 based on two single-cell spatial transcriptome datasets: one dataset of cells in the mouse hypothalamic preoptic region measured by multiplexed error robust fluorescence in situ hybridization (MERFISH) [82] and another dataset of cells in the mouse hippocampal area CA1 measured by probabilistic cell typing by in situ sequencing (pciSeq) 7 [83], a newly developed spatial transcriptome profiling technology. Both datasets contain labelled cell types. Due to the lack of simulators specifically designed for single-cell spatial transcriptome data, we still benchmark scDesign2 against ZINB-WaVE and SPARSim, the two simulaors that preserve genes. Fig. 5 shows the 2D visualization of each real dataset and the corresponding synthetic data generated by each simulator. For both datasets and in both t-SNE and PCA visualization, scDesign2 outperforms SPARSim and ZINB-WaVE by generating synthetic data that most resemble the real data. In particular, in t-SNE visualization (Fig. 5a), only the synthetic data of scDesign2 yield a clear separation of inhibitory cells (orange) and excitatory cells (purple) as in the real MERFISH data, and only scDesign2 generates synthetic data that preserve a small cluster of Sst cells (yellow) in the real pciSeq data. In PCA visualization (Fig. 5b), thanks to its deterministic property (PC coordinates are fixed for the same data), the similarity between the synthetic data of scDesign2 and the real data is more obvious than in the t-SNE plots (t-SNE coordinates are not fixed for the same data due to random initialization, random seed, etc.): the synthetic and real data give similar PC coordinates to every cell type, which is, however, less the case for the synthetic data of MERFISH and SPARSim. These results confirm the versalitity and robustness of scDesign2.

Application 2: scDesign2 guides experimental design and computational method benchmarking in cell clustering
Cell clustering is a ubiquitous computational task in single-cell research. Here we demonstrate how scDesign2 can guide experimental design (i.e., deciding the optimal cell number and sequencing depth) and benchmark computational methods for the cell clustering task.
After training scDesign2 on each of the four scRNA-seq datasets generated by different experimental protocols (10x Genomics [76], CEL-Seq2 [77], and Fluidigm C1 [78], Smart-Seq2 [79]), we apply the trained scDesign2 to generate synthetic data under two experimental design scenarios: (1) varying sequencing depths, where the total number of reads (or UMIs) varies but the cell number is fixed; (2) varying cell numbers, where the number of sequenced cells varies but the sequencing depth is fixed. For each protocol, scDesign2 generates a synthetic dataset per sequencing depth and cell number.
To guide the choices of sequencing depth and cell number based on clustering accuracy, we apply two popular scRNA-seq cell clustering methods-Seurat (the kNN-Jaccard-Louvain algorithm) [41,84] and SC3 [40]-to the synthetic datasets and use the adjusted mutual information (AMI) [85] and the adjusted Rand index (ARI) [86] as two clustering accuracy measures. Note that SC3 can be specified to output the same number of cell clusters as the annotated cell types, while Seurat cannot due to the nature of the Louvain algorithm it uses [84]. The results are summarized in Figs. 6-7 and Supplementary Figs. S20-S25.
For the first, varying-sequencing-depth scenario, we expect the clustering accuracy to improve as the sequencing depth increases, because a larger sequencing depth would better reveal every 8 cell's transcriptome profile and thus lead to better clustering. Moreover, we also expect there to be a saturation effect: the clustering accuracy no longer improves much after the sequencing depth increases to a point, which we regard as the optimal sequencing depth that balances clustering accuracy and budget. The results confirm our expectation. For the two UMI-based protocols 10x Genomics and CEL-Seq2, we observe the improvement and the saturation effect in clustering accuracy, based on both Seurat and SC3, as the sequencing depth increases. In detail, the saturation for 10x Genomics data occurs at 113.05 million UMIs for 3793 cells, while the real dataset has only 28.57 million UMIs (Fig. 6); the saturation for CEL-Seq2 data occurs at 42.72 million UMIs for 2279 cells, while the real dataset contains 172.14 million UMIs (Fig. S20). For the two non-UMI-based protocols Fluidigm C1 and Smart-Seq2, we observe the saturation effect even at the lowest sequencing depth we consider, likely due to the fact that these two protocols provide a deeper profiling of every cell than the UMI-based protocols do. In detail, the saturation for Fluidigm however, as the cell number reaches a point where every cell type has more than enough cells, further increasing the cell number would obscure every cell's profile and deteriorate clustering accuracy. For the two UMI-based protocols 10x Genomics and CEL-Seq2, our expectation is confirmed: we observe an overall trend of clustering accuracy that first increases and then decreases ( Fig. 7b and Supplementary Fig. S21b). In detail, for 10x Genomics data, both Seurat and SC3 have their accuracy maximized at 948 cells. This optimality is supported by the t-SNE visualization, which shows that the Seurat and SC3 cell clusters best agree with the annotated cell types at this optimal cell number (Fig. 7a). Hence, the real data cell number 3,793 is not optimal for distinguishing the annotated cell types by Seurat or SC3. For CEL-Seq2 data, Seurat and SC3 have optimal accuracy at 2,279 and 570 cells, respectively, also supported by the t-SNE visualization (Supplementary Fig. S21a). This suggests that the real data cell number 2,279 can lead to optimal cell clustering by Seurat. In contrast, for the two non-UMI-based protocols Fluidigm C1 and Smart-Seq2, we only observe a first-increasing-and-then-saturated trend of clustering accuracy as the cell number increases, without seeing the trend decreasing (except for SC3 on Smart-Seq2 data) (Supplementary Figs. S23b and S25b). A likely reason is that these two protocols can provide a clear profile of every cell up to a large cell number around 10,000 given their deep sequencing depths in real data (869.24 million reads in the Fluidigm C1 data and 1074.95 million reads in the Smart-Seq2 data). For both Seurat and SC3, the cell numbers at which their performance saturates are close to the cell numbers in real data: 317 cells in the Fluidigm C1 data and 1078 cells in the Smart-Seq2 data. In conclusion, we use scDesign2 to find that the cell numbers are close to being optimal in the CEL-Seq2, Fluidigm C1, and Smart-Seq2 datasets. For 10x Genomics, we would recommend decreasing the cell number to 948 cells (while keeping the sequencing depth at 28.58 million UMIs) to optimize the clustering accuracy by either Seurat or SC3.
Beyond experimental design, scDesign2 also provides a comprehensive comparison of Seurat and SC3 across sequencing depths and cell numbers. Overall, both methods demonstrate superb accuracy in a wide range of sequencing depths and cell numbers for every protocol. At closeto-optimal sequencing depths and cell numbers for each method, SC3 has better accuracy than Seurat. However, Seurat and SC3 has different robustness: Seurat is a more robust method for 10x genomics data when the sequencing depth is too low or the cell number is too large (Figs. 6b-7b), while SC3 is more robust when the cell number is small for CEL-Seq2 (Supplementary This finding is consistent with the fact that SC3 is an ensemble method that is more robust against a small number of cells but cannot be easily scaled up when the cell number is too large.

Application 3: scDesign2 guides experimental design and computational method benchmarking in rare cell type detection
Rare cell type detection is another important application of scRNA-seq, whose high-throughput profiling of cells opens an unprecedented opportunity to identify unknown cell types that are often rare but critical. Here we demonstrate how scDesign2 can guide experimental design (i.e., deciding the optimal cell number and sequencing depth) and benchmark computational methods for the rare cell type detection task.
From the 10x Genomics dataset of mouse intestine epithelial tissue [76], we select six cell types-stem cells (Stem), goblet cells (Goblet), tuft cells (Tuft), early transit amplifying cells (TA.Early), enterocyte progenitors (EP), and early enterocyte progenitors (EP.Early), among which Tuft is the rare cell type [87] and has a proportion less than 5% among the six cell types. After training scDesign2 on this dataset, we use scDesign2 to generate synthetic data under two experimental design scenarios: (1) varying sequencing depths, where the total number of UMIs varies but the cell number is fixed; (2) varying cell numbers, where the number of sequenced cells varies but the sequencing depth is fixed. For every sequencing depth and cell number, scDesign2 generates a synthetic dataset.
To guide the choices of sequencing depth and cell number based on rare-cell-type detection accuracy, we apply two popular methods-FiRE [46] and GiniClust2 [45]-to the synthetic datasets and evaluate four accuracy measures: precision (the percentage of truly rare cells among the detected rare cells), recall (the percentage of detected rare cells among the truly rare cells), F1score (the harmonic mean of the precision and recall), and AUPRC (the area under the precisionrecall curve). Since GiniClust2 does not allow adjustment of its detection threshold, we cannot calculate its AUPRC. However, as most users of FiRE would stick with its default threshold, the AUPRC measure is not as informative as the other three measures from a user's perspective.
For the first, varying-sequencing-depth scenario, we expect that the detection accuracy would improve as the sequencing depth increases and there would be a saturation effect, similar to our expectation for cell clustering. The detection accuracy of FiRE and GiniClust2 roughly confirm our expectation. Across twelve sequencing depths ranging from 1.76 to 3612.4 million UMIs (with the cell number fixed as 3793, the number of cells in real data), we observe an overall trend of increasing detection accuracy with few exceptions (Fig. 8). For FiRE, its accuracy exhibits saturation after the sequencing depth reaches 457.23 million UMIs (Fig. 8a & c), while for GiniClust2 the saturation occurs earlier at a sequencing depth of 113.05 million UMIs (Fig. 8b & d). million UMIs (Fig. 8d) is explained by the t-SNE visualization (Fig. 8b), which shows that GiniClust2 misidentifies the second largest cell cluster as the rare cell type, leads to many FP and FN cells, and results in close to zero precision and recall. Combining the FiRE and GiniClust2 results, we conclude that the real data sequencing depth at 28.57 million UMIs for 3793 cells is not optimal for detecting the rare cell type Tuft (Fig. 8c-d). If budget allows, we would recommend increasing the sequencing depth to 113.06 million UMIs and use GiniClust2 to detect tuft cells.
For the second, varying-cell-number scenario, we expect the detection accuracy to first increase and then decrease as the cell number increases, similar to our expectation for cell clustering. Again, the detection accuracy of FiRE and GiniClust2 confirm our expectation. Across thirteen cell numbers ranging from 29 to 121,376 (with the sequencing depth fixed as 28.57 million UMIs, the same as in real data), we observe an overall trend of detection accuracy that first increases 11 and then decreases (Fig. 9). For both FiRE and GiniClust2, their F1-scores are optimal at 1,896 cells ( Fig. 9c-d). This optimality is supported by the t-SNE visualization, which shows a plot of synthetic cells with TP, FP, FN, and TN labels for every cell number and each detection method ( Fig. 9a-b). Hence, the real data cell number 3793 is not optimal for detecting tuft cells given the total sequencing depth of 28.57 million UMIs. If the detection of tuft cells is a primary goal and the sequencing depth cannot be increased due to budget constraints, we would recommend decreasing the cell number of 1896 cells and use GiniClust2 to detect tuft cells.
In addition to assisting experimental design, scDesign2 also provides an objective comparison of FiRE and GiniClust2 across sequencing depths and cell numbers. Figs. 8-9 show that GiniClust2 has much better accuracy than FiRE at close-to-optimal sequencing depths and cell numbers. However, FiRE is a more robust method that it can successfully run at all sequencing depths and cell numbers, while GiniClust2 fails when the cell number is too small or too large (GiniClust3 may have addressed this large-cell-number issue [88]). This finding is consistent with the methodological difference between the two methods: FiRE detects rare cells via an outlier detection approach, while GiniClust2 first performs cell clustering and then identifies the cells in small clusters as rare cells. The requirement of cell clustering explains why GiniClust2 fails when the cells exhibit no clear clusters and why it works well when rare cells form small clear clusters.
In contrast, outlier detection has no requirement on cluster patterns, and this explains why FiRE is robust.

Discussion
In this article, we propose scDesign2, an interpretable simulator for single-cell gene expression count data. Our development of scDesign2 is motivated by the pressing challenge to generate realistic synthetic data for various scRNA-seq protocols and other single-cell gene expression count-based technologies. Unlike existing simulators including our previous simulator scDesign, scDesign2 achieves six properties: protocol adaptiveness, gene preservation, gene correlation capture, flexible cell number and sequencing depth choices, interpretability, and computational and sample efficiency. This achievement of scDesign2 is enabled by its unique use of the copula statistical framework, which combines marginal distributions of individual genes and the global correlation structure among genes. As a result, scDesign2 has the following methodological advantages that contribute to its high degree of transparency and interpretability. First, it selects a marginal distribution from four options (Poisson, ZIP, NB, and ZINB) for each gene in a data-driven manner to best capture and summarize the expression characteristics of that gene. Second, it uses a Gaussian copula to estimate gene correlations, which will be used to generate synthetic single-cell gene expression counts that preserve the correlation structures. Third, it can generate gene expression counts according to user-specified sequencing depth and cell number.
We have performed a comprehensive set of benchmarking and real data studies to evaluate scDesign2 in terms of its accuracy in generating synthetic data and its efficacy in guiding exper-imental design and benchmarking computational methods. Based on four scRNA-seq protocols and 12 cell types, our benchmarking results demonstrate that scDesign2 better captures gene expression characteristics in real data than eight existing scRNA-seq simulators do. In particular, among the four simulators that aim to preserve gene correlations, scDesign2 achieves the best accuracy. Moreover, we demonstrate the capacity of scDesign2 in generating synthetic data of other single-cell count-based technologies including MERFISH or pciSeq, two single-cell spatial transcriptomics technologies. After validating the realistic nature of synthetic data generated by scDesign2, we use real data applications to demonstrate how scDesign2 can guide the selection of cell number and sequencing depth in experimental design, as well as how scDesign2 can benchmark computational methods for cell clustering and rare cell type identification.
The current implementation of scDesign2 is restricted to single-cell datasets composed of discrete cell types, because the generative model of scDesign2 assumes that cells of the same type follow the same distribution of gene expression. However, many single-cell datasets exhibit continuous cell trajectories instead of discrete cell types. A nice property of the probabilistic model used in scDesign2 is that it is generalizable to account for continuous cell trajectories. First, we can use the Generalized Additive Model (GAM) [89][90][91] to model each gene's marginal distribution of expression as a function of cell pseudotime, which can be computationally inferred from real data [50,51,53]. Second, the copula framework can be used to incorporate gene correlation structures along the cell pseudotime. Combining these two steps into a generative model, this extension of scDesign2 has the potential to overcome the current challenge in preserving gene correlations encountered by existing simulators for single-cell trajectory data, such as Splatter Path [65], dyngen [92], and PROSSTT [64]. Another note is that scDesign2 does not generate synthetic cells based on outlier cells that do not cluster well with any cells in well-formed clusters. This is not necessarily a disadvantage, neither is it a unique feature to scDesign2. In fact, all model-based simulators that learn a generative model from real data must ignore certain outlier cells that do not fit well to their model. Some outlier cells could either represent an extremely rare cell type or are just "doublets" [93][94][95][96], artifacts resulted from single-cell sequencing experiments.
Hence, our stance is that ignorance of outlier cells is a sacrifice that every simulator has to make; the open question is the degree to which outlier cells should be ignored, and proper answers to this question must resort to statistical model selection principles.
Although we only use cell clustering and rare cell type detection to demonstrate scDesign2's use in guiding experimental design and benchmarking computational methods, we want to emphasize that scDesign2 has broad applications beyond these two tasks. Inheriting the flexible and interpretable modeling nature of our previous simulator scDesign, scDesign2 can also benchmark other computational analyses we have demonstrated in our scDesign paper [34], including differential gene expression analysis and cell dimensionality reduction. Moreover, beyond its role as a simulator, scDesign2 may benefit single-cell gene expression data analysis by providing its estimated parameters about gene expression and gene correlations. Here we discuss three potential directions. First, scDesign2 can assist differential gene expression analysis. Its esti-13 mated marginal distributions of individual genes in different cell types can be used to investigate more general patterns of differential expression (such as different variances and different zero proportions), in addition to comparing gene expression means between two groups of cells [97].
Second, its estimated gene correlation structures can be used to construct cell-type-specific gene networks [98] and incorporated into gene set enrichment analysis to enhance statistical power [99,100]. Third, scDesign2 has the potential to improve the alignment of cells from multiple singlecell datasets [101]. Its estimated gene expression parameters can guide the calculation of cell type or cluster similarities between batches, and its estimated gene correlation structures can be used to align cell types or clusters across batches based on the similarity in gene correlation structures. [102].

The statistical framework of scDesign2 Fitting a generative model of single-cell gene expression count data with gene correlations
Given an scRNA-seq count matrix X 2 N p⇥n with p genes and n cells, we assume that the n cells belong to K cell types and that the cell memberships have been assigned by clustering, labeled by marker genes, or known in advance. (For input data without pre-defined cell types, our recommendation for cell clustering is in two subsections.) Our goal is to fit a parametric count model to characterize the joint distribution of genes' counts in each cell type. For cell type k, We denote its number of cells by n (k) , its count sub-matrix by X (k) , and its set of model parameters by ⇥ (k) , k = 1, . . . , K. For simplicity of notation, we drop the superscript (k) in the following discussion about the generative model for one single cell type.
We denote X ·j = (X 1j , . . . , X pj ) T 2 R p as a random p-dimensional gene count vector in cell j, j = 1, . . . , n. We denote its realization, i.e., the observed gene count vector as the j-th column in X, by x ·j = (x 1j , . . . , x pj ) T . Jointly for the p genes, we assume that X ·j independently follows a p-dimensional distribution F , which we will specify by a copula in the next paragraph. Marginally for each gene i, we assume that X ij independently follows a univariate count distribution F i . For Note that the Z ij 's are unobserved and introduced only to describe the zero-inflation component.
The Poisson, the zero-inflated Poisson (ZIP), and the negative binomial (NB) distributions are three special cases of the ZINB distribution, where p i = 0 for Poisson and NB, and i = 1 for Poisson and ZIP. From these four distributions, scDesign2 automatically chooses the one that best fits to gene i's observed counts. Specifically, for the i-th row of X, x i· = (x i1 , . . . , x in ) T , if its sample meanx i· = n 1 P n j=1 x ij its sample varianceˆ 2 i = (n 1) 1 P n j=1 (x ij x i· ) 2 , i.e., there is no over-dispersion, scDesign2 fits the Poisson and the ZIP distributions separately to x i· by maximum likelihood estimation (MLE), and then performs a likelihood ratio test with 2 1 as the null distribution to determine if zero-inflation is significant, i.e., the ZIP distribution should be chosen over the Poisson distribution. Otherwise if there is over-dispersion, i.e.,x i· <ˆ 2 i , scDesign2 fits the NB and the ZINB distributions separately to x i· by MLE and then performs a likelihood ratio test with 2 1 as the null distribution to determine if zero-inflation is significant, i.e., the ZINB distribution should be chosen over the NB distribution. The default significance level (i.e., p-value cutoff) for both tests is For cell j's gene count vector X ·j 2 R p , although its i-th component X ij may not follow the Uniform[0, 1] distribution, we can transform X ij by applying the marginal CDF F i so that 1]. This allows us to write the joint CDF F as which is decomposable into the copula C and the marginal distributions F 1 , . . . , F p . Sklar's theorem states that such a decomposition exists uniquely for any continuous distribution F [103]. If F is discrete in any dimension, the copula C still exists but may not be unique, i.e., not identifiable [104,105]. To resolve this unidentifiability issue, scDesign2 uses the technique of distributional transform [106]: first draw V ij ⇠ Uniform[0, 1] independently for i = 1, . . . , p and j = 1, . . . , n; second define U ij as Then the CDF F of X ·j is defined as the copula C of U ·j = (U 1j , . . . , U pj ) T : where (u 1j , . . . , u pj ) T is a realization of (U 1j , . . . , U pj ) T . In scDesign2, we choose C as the Gaussian copula. Denoting by the CDF of a standard Gaussian distribution, we define F as F (x 1j , . . . , x pj ) = p ( 1 (u 1j ), . . . , 1 (u pj ); R) where p (·; R) is the CDF of a p-dimensional Gaussian distribution with a zero mean vector and 15 a covariance matrix R. To estimate R, we use the following procedure. Denote by (p i ,ˆ i ,μ i ) the estimated parameters of F i , which specify a fitted marginal distribution b F i . We sample v ⇤ ij from Uniform[0, 1] independently for i = 1, . . . , p and j = 1, . . . , n, and we calculate u ⇤ ij as Then we estimate R by the sample covariance matrix of To summarize, scDesign2 first estimate the marginal distributions F 1 , . . . , F p as b F 1 , . . . , b F p , each of which may be a fitted Poisson, ZIP, NB, or ZINB distribution. Then scDesign2 calculates u ⇤ ij 's as described above and estimates a p ⇥ p covariance matrix as b R. Finally, scDesign2 estimates the p-dimensional joint distribution F as whose estimated model parameters are b ⇥ = {p 1 ,ˆ 1 ,μ 1 , . . . ,p p ,ˆ p ,μ p , b R}. As a practical note, since X contains many zeros, before fitting the above generative model for each cell type, scDesign2 partitions the genes into three groups: the first group containing genes with zero proportions less than a cutoff (default 0.8), the second group containing genes with zero proportions between the cutoff and (n 2)/n, where n is the number of cells, and the last group including the remaining genes, i.e., genes expressed in fewer than three cells. For the first group, scDesign2 fits the above generative model jointly for its genes. For the second group, scDesign2 fits a marginal distribution for each individual gene. For the last group, scDesign2 only generates zero counts for all its genes.

Generation of synthetic single-cell gene expression count data
To generate synthetic scRNA-seq data for K cell types, scDesign2 first estimates the proportions of K cell types from the real scRNA-seq count matrix X, for which we denote the number of reads mapped to the n (k) cells of type k as N (k) , and the total number of reads mapped to all the n cells as N = P K k=1 N (k) . Denoting the cell type proportions as ⇡ = (⇡ (1) , . . . , ⇡ (K) ) T such that P K k=1 ⇡ (k) = 1, scDesign2 estimates them by⇡ = (⇡ (1) , . . . ,⇡ (K) ) T , wherê We denote the synthetic scRNA-seq data to be generated as X 0 , which contains n 0 cells and N 0 expected number of reads, with n 0 and N 0 as user-specified input parameters of scDesign2. Denoting the number of synthetic cells of type k as n (k)0 , scDesign2 draws the numbers of synthetic cells of all K cell types from a multinomial distribution, i.e., (n (1) 0 , . . . , n (K) 0 ) T ⇠ Multinomial(n 0 ,⇡).
Then given n (k) 0 , the expected number of reads assigned to cell type k in X 0 should be However, given the constraint that the expected total number of reads in X 0 is N 0 , we need to As a result, the scaling factor is , which does not depend on the cell type, and scDesign2 uses this scaling factor to rescale the mean parameter of every gene.
Given the fitted generative model b and the scaling factor r, scDesign2 generates n (k)0 synthetic cells from a rescaled model b Concretely, how the data generation works is that scDesign2 first draws n (k)0 vectors, denoted as z (k) ·j 0 2 IR p ; j = 1, . . . , n (k)0 , independently from p (·; b R (k) ). Then scDesign2 converts z i ) (including the Poisson, ZIP, and NB distributions as special cases), i = 1, . . . , p. Finally, scDesign2 outputs the Note that the synthetic count matrix X 0 does not contain exactly N 0 reads; rather, N 0 is the expected total number of reads. We think this setting mimics a real sequencing experiment, where the total number of sequenced reads would not be exactly the same as the preset sequencing depth N 0 due to experimental randomness.

Recommendation for cell clustering when input data do not have labelled cell types
If users would like to train scDesign2 on a gene-by-cell count matrix without cell type labels, a necessary preceding step is cell clustering. We recommend users to choose a state-of-the-art cell clustering method such as Seurat and SC3. For the resulting clusters, we recommend users to visualize them by t-SNE or UMAP and use a goodness-of-fit measure (e.g., Pearson's chi-square statistic and ROGUE score [107]) to check whether each gene approximately follows a NB or ZINB distribution in a cell cluster. This check will guide users to decide on an appropriate number of cell clusters in a data-driven way.

The scDesign2 variant without copula
The only difference between this variant "w/o copula" and scDesign2 is that this variant assumes the p genes to have independent marginal distributions F 1 , . . . , F p . The fitting of the p marginal distributions and the generation of synthetic data is the same as those in scDesign2.

Existing simulators
• scDesign: The R package scDesign version 1.0.0 is used for the analysis.
• splat, splat simple, kersplat: These methods are executed from the R packge splatter version 1.10.1.
• SPARSim: The R package SPARSim version 0.9.5 is used for the analysis.
• ZINB-WaVE: The ZINB-WaVE method is used from the wrapper functions in the R package splatter version 1.10.1.
• scDesign: The R package scDesign version 1.0.0 is used for the analysis.

Dimensionality reduction methods
• t-SNE: The R package Rtsne version 0.15 is used for generating t-SNE plots. The function Rtsne is used, with all parameters set to default, except check duplicate = FALSE and perplexity is changed from the default value of 30 to one third of the sample size when the sample size (total number of cells) is less than 90.
• PCA: The R function prcomp() is used for generating PCA plots, with parameters set as default.

Cell clustering methods
• Seurat: The Seurat clustering method is executed by the following instruction in this tutorial https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html. R package Seurat version 3.1.5 is used for the analysis.
html. R package SC3 version 1.14.0 is used for the analysis.

Rare cell type detection methods
• FiRE: The FiRE method is executed by the following instruction in this tutorial https:// github.com/princethewinner/FiRE. R package FiRE version 1.0 is used for the analysis. • CEL-Seq2: The CEL-Seq2 dataset measures the human pancreas [77]. The raw count dataset is downloaded from GEO with accession number GSE85241. Data for cell types alpha, beta, acinar, delta, duct, endothelial, mesenchymal and pancreatic polypeptide cell (pp) were selected for analysis. Spike-in RNA counts were filtered. The resulting count matrix contains 19049 genes and 2279 cells.
• Fluidigm C1: The Fluidigm C1 dataset measures human brain cells [78]. The raw count dataset is downloaded from GEO with accession number GSE67835. Data for cell types astrocytes, endothelial, fetal quiescent, hybrid neurons, oligodendrocytes and oligodendrocyte precursor cell (OPC) were selected for analysis. The resulting count matrix contains 22088 genes and 317 cells.
• Smart-Seq2: The Smart-Seq2 dataset measures human blood dendritic cells [79]. The raw count dataset is downloaded from GEO with accession number GSE94820. Data for dendrocyte subtypes 1-6 and monocyte subtypes 1-4 were selected for analysis. Spike-in RNA counts were filtered. The resulting count matrix contains 26586 genes and 1078 cells.
The cell type "Zero" is removed as it contains cells with almost no genes expressed, so seven cell types are retained. The processed data contain 84 genes and 2253 cells.   Smoothed relationships between three pairs of gene-wise statistics (zero proportion vs. mean, variance vs. mean, and cv vs. mean) across all genes (curves plotted by the R function geom smooth()) in the real data and the synthetic data generated by scDesign2 and the eight existing simulators (others). Note that ZINB-WaVE and SymSim filter out certain genes when simulating new data; Pearson correlation and Kendall's tau are only calculated between the genes whose zero proportions are less than 50%; gene-wise mean and variance and cell-wise library size are transformed to the log 10 (1 + x) scale (where x represents a statistic's value).

Pearson correlation matrices
Kendall's tau matrices Kendall's tau matrices. In (a) and (c), training and test data contain goblet cells measured by 10x Genomics [76]; In (b) and (d), training and test data contain cells of dedrocytes subtype 1 (DC1) measured by Smart-Seq2 [79]. For each cell type, the Pearson correlation matrices and Kendall's tau matrices are shown for the 100 genes with the highest mean expression values in the test data; the rows and columns (i.e., genes) of all the matrices are ordered by the complete-linkage hierarchical clustering of genes (using Kendall's tau as the similarity) in the test data. We find that the correlation matrices estimated from the sythetic data generated by scDesign2 most resemble those of training and test data. training data, test data, synthetic data generated by each simulator, and combinations of test data and each synthetic dataset. Gene expression counts are transformed as log(1 + count) before dimensionality reduction. We find that the synthetic data generated by scDesign2 most resemble the test data. Gene expression counts are transformed as log(1 + count) before dimensionality reduction. We find that the synthetic data generated by scDesign2 most resemble the real data.