Benchmarking Statistical and Machine-Learning Methods for Single-cell RNA Sequencing Data
- Author(s): Xi, Nan
- Advisor(s): Li, Jingyi
- et al.
The large-scale, high-dimensional, and sparse single-cell RNA sequencing (scRNA-seq) data have raised great challenges in the pipeline of data analysis. A large number of statistical and machine learning methods have been developed to analyze scRNA-seq data and answer related scientific questions. Although different methods claim advantages in certain circumstances, it is difficult for users to select appropriate methods for their analysis tasks. Benchmark studies aim to provide recommendations for method selection based on an objective, accurate, and comprehensive comparison among cutting-edge methods. They can also offer suggestions for further methodological development through massive evaluations conducted on real data.
In Chapter 2, we conduct the first, systematic benchmark study of nine cutting-edge computational doublet-detection methods. In scRNA-seq, doublets form when two cells are encapsulated into one reaction volume by chance. The existence of doublets, which appear as but are not real cells, is a key confounder in scRNA-seq data analysis. Computational methods have been developed to detect doublets in scRNA-seq data; however, the scRNA-seq field lacks a comprehensive benchmarking of these methods, making it difficult for researchers to choose an appropriate method for their specific analysis needs. Our benchmark study compares doublet-detection methods in terms of their detection accuracy under various experimental settings, impacts on downstream analyses, and computational efficiency. Our results show that existing methods exhibited diverse performance and distinct advantages in different aspects.
In Chapter 3, we develop an R package DoubletCollection to integrate the installation and execution of different doublet-detection methods. Traditional benchmark studies can be quickly out-of-date due to their static design and the rapid growth of available methods. DoubletCollection addresses this issue in benchmarking doublet-detection methods for scRNA-seq data. DoubletCollection provides a unified interface to perform and visualize downstream analysis after doublet-detection. Additionally, we created a protocol using DoubletCollection to execute and benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field.
In Chapter 4, we conduct the first comprehensive empirical study to explore the best modeling strategy for autoencoder-based imputation methods specific to scRNA-seq data. The autoencoder-based imputation method is a family of promising methods to denoise sparse scRNA-seq data; however, the design of autoencoders has not been formally discussed in the literature. Current autoencoder-based imputation methods either borrow the practice from other fields or design the model on an ad hoc basis. We find that the method performance is sensitive to the key hyperparameter of autoencoders, including architecture, activation function, and regularization. Their optimal settings on scRNA-seq are largely different from those on other data types. Our results emphasize the importance of exploring hyperparameter space in such complex and flexible methods. Our work also points out the future direction of improving current methods.