Today, as the world is stricken by the proliferation of novel infectious pathogens, we are faced with the urgent need for new anti-infective therapeutic agents. Natural products, also known as specialized metabolites, are chemical compounds produced by living organisms and have served as an excellent source for drug discovery. Many clinically used small molecules including various antimicrobial, anticancer, antiviral, and immunosuppressant drugs, are either natural products or are inspired by them. Traditionally, natural products were discovered mostly through slow and laborious experiments that often lead to rediscovering previously known compounds.
Over the past decade, advancements in short/long-read (meta)genomics and tandem mass spectrometry (MS/MS) technologies provided an unprecedented resource for large-scale natural product discovery. In accordance with these advancements, scalable bioinformatics algorithms are required to leverage this massive data and enable analyses of natural products across thousands of samples. In this dissertation, I present several scalable computational methods for discovering novel natural products using the (MS/MS-based) metabolomics and/or (meta)genomics data.
In the first chapter, I present CycloNovo, the first algorithm for scalable de novo sequencing of MS/MS data to discover cyclic and branch cyclic peptides (referred to as cyclopeptides). Cyclopeptides constitute a diverse and biomedically important class of natural products. CycloNovo employs de Bruijn graphs, the workhorse of DNA sequencing algorithms, for efficient cyclopeptide sequencing and revealed a wealth of novel cyclopeptides, including a large hidden cyclopeptidome in the human gut.
In the following chapters, I discuss bioinformatics methods for discovering Non-Ribosomal Peptides (NRPs) that include a multitude of antibiotics and other clinically used drugs. NRPs are produced by metabolic pathways partially encoded by Biosynthetic Gene Clusters (BGCs). In the second chapter, I present NRPminer, a modification-tolerant and scalable algorithm for NRP discovery by integrating (meta)genomic and MS/MS data. NRPminer identified many novel NRPs from different origins, including novel NRPs produced by soil-associated microbes and human microbiota. Finally, I discuss the problem of identifying NRP-producing BGCs in the human gut microbiome and I show long-read metagenomic assemblies can be used to reveal many BGCs that synthesize previously unknown NRPs in the human gut microbiome.