- Main
Improving Metabolomics Coverage and Standardization
- Bremer, Parker Ladd
- Advisor(s): Fiehn, Oliver
Abstract
Progress in metabolomics has brought the field from investigations of pre-selectedcompound lists and limited sample size toward comprehensive compound exploration of large sample size. This shift in focus demanded corresponding advances in informatics areas that we explore in this dissertation, such as in-silico compound identification tools, metabolomics meta- analysis, and metabolomics repository design. In Chapter 1, we focus on compound identification. Compound identification is traditionally treated as an information retrieval problem, where unknown compounds are identified by comparing their observed signals to the signals of chemical standards. Unfortunately, the metabolome contains significantly more compounds than standards, so there is a desire to computationally expand the space of indexed signals. Here, we benchmark a tool, CFM-ID, that predicts the signal of a compound based on its structure. We show that there is much progress needed in this area by determining that CFM-ID’s predictions could be readily replicated via heuristic rules that focus on structure. Extrapolating these ideas emphasizes the need for increased machine learning model training set sizes and standardization due to the complexity of the physics and statistical mechanics that mass spectrometry signals reflect. In Chapter 2, we focus on meta-analysis of metabolomics studies. We believe that the synchronization of many independent datasets will allow for biological insights of high confidence and/or high generality. To this end, we developed a tool named BinDiscover, which allows for rapid hypothesis generation by enabling user-directed exploration of over 150,000 samples processed at the West Coast Metabolomics Center. We believe that this tool improves existing repository meta-analysis for several reasons. First, it is programmatic in nature, which allows for meta-analysis on a timescale of minutes rather than months. Second, the meta-analysis that it iv enables is focused on sample metadata rather than study hypotheses, which dramatically expands the number of investigations that can be conducted. Third, it is dramatically easier to use than existing options. Finally, it showcases our novel procedure, ontologically-grouped-differential analysis, which allows for the convenient comparison of categories of samples (e.g., mammals digestive system organs vs. bacterial cells) in order to produce tractable amounts high-confidence results. In Chapter 3, we focus on repository design. We strongly believe that enabling the programmatic meta-analysis developed on in Chapter 2 onto a larger-scale, community- contributed repositories of metabolomics data will enable massive clinical progress. To this end, we developed a tool that standardizes sample metadata. At current, user-submitted sample metadata matrices preclude programmatic meta-analysis because they suffer from the looseness and complexity of natural language. Our multistep standardization tool employs machine learning models embedded into an intuitive frontend to ensure that only high-quality sample descriptions are lodged into repositories. Finally, in the appendix, we share several projects spanning the topics of the main chapters. In the first part, we share ClusterBase, which is a computational platform that uses network analysis to organize and annotate spectral data from metabolomics studies. In the second part, we share an automatic compound-ID workflow that harnessed the online CFM-ID tool. Finally, in the third part, we describe a machine learning approach to predicting spectral intensities that can augment quantum mechanically predicted spectra.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-