Next generation sequencing (NGS) has revolutionized biomedical research and has a broad impact and applications. Since its advent around 15 years ago, this high scalable DNA sequencing technology has generated numerous biological data with new features and brought new challenges to data analysis. For example, researchers utilize RNA sequencing (RNA-seq) technology to more accurately quantify the gene expression levels. However, the NGS technology involves many processing steps and technical variations when measuring the expression values in the biological samples. In other words, the NGS data researchers observed could be biased due to the randomness and constraints in the NGS technology. This dissertation will mainly focus on microbiome sequencing data and single-cell RNA-seq (scRNA-seq) data. Both of them are highly sparse matrix-form count data. The zeros could either be biological or non-biological, and the high sparsity in the data have brought challenges to data analysis.
Missing data imputation problem has been studied in statistics and social science as the survey data often experience non-response to some of the survey questions and those unresponded questions will be marked as "NA" or missing values in the data. Imputation methods are used to provide a sophisticated guess for the missing values, and the purpose is to avoid discarding the collected samples and for the ease of using the state-of-the-art statistical methods. In machine learning, the famous Netflix data challenge regarding film recommendation system also falls into the missing data imputation problem category. Netflix wants to find a way to predict users' fondness of the movies they have not watched. The potential scores these users would give to the unwatched films are regarded as missing values in the data. NGS data imputation problem is different from the previous two cases in that the missing values in the NGS data are not so well-defined. The zeros in the NGS data could either come from the biological origin (should not be regarded as missing values) or non-biological origin (due to the limitation of the sequencing technology and should be regarded as missing values). The size (number of samples and features) of the NGS matrix data is usually larger than the size of survey data but smaller than the size of the recommendation system data. In addition, in most cases, the percentage of missing values in the survey data is less than the percentage of zeros in the NGS data, and the missing values in the film recommendation system data have the highest percentage (> 99.9%). As a result, the commonly used missing data imputation methods in statistics and machine learning are not directly applicable to NGS data. In recent years, numerous imputation methods have been proposed to deal with the highly sparse scRNA-seq data. In light of this, this dissertation aims to address two questions. First, the microbiome sequencing data, having additional information comparing to the scRNA-seq data, lacks an imputation method. Secondly, whether to use imputation or not in scRNA-seq data analysis is still a controversial problem.
The first part of this dissertation focuses on the first imputation method developed for the microbiome sequencing data: mbImpute. Microbiome studies have gained increased attention since many discoveries revealed connections between human microbiome compositions and diseases. A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data---mbImpute---to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. Comprehensive simulations verify that mbImpute achieves better imputation accuracy under multiple metrics, compared with five state-of-the-art imputation methods designed for non-microbiome data. In real data applications, we demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.
The second part of this dissertation focuses on how to deal with high sparsity in the scRNA-seq data. ScRNA-seq technologies have revolutionized biomedical sciences by enabling genome-wide profiling of gene expression levels at an unprecedented single-cell resolution. A distinct characteristic of scRNA-seq data is the vast proportion of zeros unseen in bulk RNA-seq data. Researchers view these zeros differently: some regard zeros as biological signals representing no or low gene expression, while others regard zeros as false signals or missing data to be corrected. As a result, the scRNA-seq field faces much controversy regarding how to handle zeros in data analysis. We first discuss the sources of biological and non-biological zeros in scRNA-seq data. Second, we evaluate the impacts of non-biological zeros on cell clustering and differential gene expression analysis. Third, we summarize the advantages, disadvantages, and suitable users of three input data types: observed counts, imputed counts, and binarized counts and evaluate the performance of downstream analysis on these three input data types. Finally, we discuss the open questions regarding non-biological zeros, the need for benchmarking, and the importance of transparent analysis.