Progress in metabolomics has brought the field from investigations of pre-selectedcompound lists and limited sample size toward comprehensive compound exploration of large
sample size. This shift in focus demanded corresponding advances in informatics areas that we
explore in this dissertation, such as in-silico compound identification tools, metabolomics meta-
analysis, and metabolomics repository design.
In Chapter 1, we focus on compound identification. Compound identification is
traditionally treated as an information retrieval problem, where unknown compounds are identified
by comparing their observed signals to the signals of chemical standards. Unfortunately, the
metabolome contains significantly more compounds than standards, so there is a desire to
computationally expand the space of indexed signals. Here, we benchmark a tool, CFM-ID, that
predicts the signal of a compound based on its structure. We show that there is much progress
needed in this area by determining that CFM-ID’s predictions could be readily replicated via
heuristic rules that focus on structure. Extrapolating these ideas emphasizes the need for increased
machine learning model training set sizes and standardization due to the complexity of the physics
and statistical mechanics that mass spectrometry signals reflect.
In Chapter 2, we focus on meta-analysis of metabolomics studies. We believe that the
synchronization of many independent datasets will allow for biological insights of high confidence
and/or high generality. To this end, we developed a tool named BinDiscover, which allows for
rapid hypothesis generation by enabling user-directed exploration of over 150,000 samples
processed at the West Coast Metabolomics Center. We believe that this tool improves existing
repository meta-analysis for several reasons. First, it is programmatic in nature, which allows for
meta-analysis on a timescale of minutes rather than months. Second, the meta-analysis that it
iv
enables is focused on sample metadata rather than study hypotheses, which dramatically expands
the number of investigations that can be conducted. Third, it is dramatically easier to use than
existing options. Finally, it showcases our novel procedure, ontologically-grouped-differential
analysis, which allows for the convenient comparison of categories of samples (e.g., mammals
digestive system organs vs. bacterial cells) in order to produce tractable amounts high-confidence
results.
In Chapter 3, we focus on repository design. We strongly believe that enabling the
programmatic meta-analysis developed on in Chapter 2 onto a larger-scale, community-
contributed repositories of metabolomics data will enable massive clinical progress. To this end,
we developed a tool that standardizes sample metadata. At current, user-submitted sample
metadata matrices preclude programmatic meta-analysis because they suffer from the looseness
and complexity of natural language. Our multistep standardization tool employs machine learning
models embedded into an intuitive frontend to ensure that only high-quality sample descriptions
are lodged into repositories.
Finally, in the appendix, we share several projects spanning the topics of the main chapters.
In the first part, we share ClusterBase, which is a computational platform that uses network
analysis to organize and annotate spectral data from metabolomics studies. In the second part, we
share an automatic compound-ID workflow that harnessed the online CFM-ID tool. Finally, in the
third part, we describe a machine learning approach to predicting spectral intensities that can
augment quantum mechanically predicted spectra.