Beginning with microarray data in the 90's, omics technologies have exploded in the last three decades. Proteomic, metabolomic, genomic and epigenomic data are used to understand disease etiology, to detect diseases early on and to identify novel disease therapies. Almost every omics dataset is the result of a complicated experiment and data collection process. The unwanted variation introduced during the experimental process, along with biological complexity and heterogeneity, requires extensive exploratory data analysis and pre-processing to understand the variability within the data.
The goal throughout this dissertation is to demonstrate the need for appropriate exploratory data analysis and pre-processing in various omics data types, and to provide examples of such. Exploratory data analysis refers to extensive visualization and summarization of omics data in order to understand distributional properties of samples and features, to identify unwanted variation, to determine biological patterns, etc. Work done during exploratory data analysis informs subsequent data pre-processing, or a series of steps taken to filter samples and features, to impute missing values, to normalize or transform the data, etc., prior to performing formal statistical analyses.
Here, we first demonstrate exploratory data analysis and pre-processing within the context of single-cell RNA-sequencing data. One goal of single cell RNA-sequencing (scRNA-seq) is to expose possible heterogeneity within cell populations due to meaningful, biological variation. Examining cell-to-cell heterogeneity, and further, identifying subpopulations of cells based on scRNA-seq data has been of common interest in life science research. A key component to successfully identifying cell subpopulations (or clustering cells) is the (dis)similarity measure used to group the cells. We introduce a novel measure, named SIDEseq, to assess cell-to-cell similarity using scRNA-seq data. SIDEseq first identifies a list of putative differentially expressed (DE) genes for each pair of cells. SIDEseq then integrates the information from all the DE gene lists (corresponding to all pairs of cells) to build a similarity measure between two cells. SIDEseq can be implemented in any clustering algorithm that requires a (dis)similarity matrix. This new measure incorporates information from all cells when evaluating the similarity between any two cells, a characteristic not commonly found in existing (dis)similarity measures. This property is advantageous for two reasons: (a) borrowing information from cells of different subpopulations allows for the investigation of pair-wise cell relationships from a global perspective, and (b) information from other cells of the same subpopulation could help to ensure a robust relationship assessment. We applied SIDEseq to a newly generated human ovarian cancer scRNA-seq dataset, a public human embryo scRNA-seq dataset and several simulated data sets. The clustering results suggest that the SIDEseq measure is capable of uncovering important relationships between cells, and outperforms or at least does as well as several popular (dis)similarity measures when used on these datasets.
We then focus on exploratory data analysis and pre-processing in the context of adductomics data. Metabolism of chemicals from the diet, exposures to xenobiotics, the microbiome, and lifestyle factors (e.g., smoking, alcohol intake) produce reactive electrophiles that react with nucleophilic sites in DNA and proteins. Since many of these reactive intermediates are unknown, we reported an untargeted adductomics method to detect Cys34 modifications of human serum albumin (HSA) in human serum and plasma. Here, we extended that assay to investigate HSA-Cys34 adducts in archived newborn dried blood spots (DBS). As proof-of-principle, we applied the method to 49 archived DBS collected from newborns whose mothers either actively smoked during pregnancy or were nonsmokers. Twenty-six HSA-Cys34 adducts were detected, including Cys34 oxidation products, mixed disulfides with low-molecular-weight thiols (e.g., cysteine, homocysteine, glutathione, cysteinylglycine, etc.), and other modifications. We used careful exploratory data analysis and data pre-processing methods to uncover biological signal in this relatively new omics data type. With an ensemble of statistical approaches, the Cys34 adduct of cyanide was found to consistently discriminate between newborns with smoking versus nonsmoking mothers with a mean fold change (smoking/nonsmoking) of 1.31. Our DBS-based adductomics method is currently being applied to discover in utero exposures to reactive chemicals and metabolites that may influence disease risks later in life.
Finally, we show how exploratory data analysis and pre-processing is essential for the successful analysis of untargeted metabolomics data. Untargeted metabolomics datasets contain large proportions of uninformative features that can impede subsequent statistical analysis such as biomarker discovery and metabolic pathway analysis. Thus, there is a need for versatile and data-adaptive methods for filtering data prior to investigating the underlying biological phenomena. Here, we propose a data-adaptive pipeline for filtering metabolomics data that are generated by liquid chromatography-mass spectrometry (LC-MS) platforms. Our data-adaptive pipeline includes novel methods for filtering features based on blank samples, proportions of missing values, and estimated intra-class correlation coefficients. Using metabolomics datasets that were generated in our laboratory from samples of human blood serum, as well as two public LC-MS datasets, we compared our data-adaptive filtering method with traditional methods that rely on non-method specific thresholds. The data-adaptive approach outperformed traditional approaches in terms of removing noisy features and retaining high quality, biologically informative ones. Our proposed data-adaptive filtering pipeline is intuitive and effectively removes uninformative features from untargeted metabolomics datasets. It is particularly relevant for interrogation of biological phenomena in data derived from complex matrices associated with biospecimens.