Reproducible Analysis of Complex Data & Prediction of Bioactive Natural Compounds
- Zhang, Le
- Advisor(s): Girke, Thomas
Abstract
This dissertation presents three projects that are described in individual chapters. The first two projects are about software development for reproducible research involving big data, and the third project is about large-scale discovery of drug-like natural compounds. The following briefly summarizes each chapter.
Chapter 1 introduces the general research area of each project. These introductions focus on the status of the field, current challenges and future needs. At the end of each project introduction, an overview is given about the specific solutions proposed and implemented by the research of this dissertation.
Chapter 2 describes the design and development of systemPipeR. This generic software environment provides flexible utilities for designing, building and running automated end-to-end analysis workflows for a wide range of research applications. Important functionalities include a uniform workflow interface across different data analysis applications, automated report generation, and support for running both R and command-line software, such as big data analysis tools, on personal computers, HPC clusters, and cloud systems.
The third chapter introduces systemPipeShiny. This web system extends the systemPipeR workflow environment with a versatile graphical user interface provided by a Shiny App. It allows non-R users to run many of systemPipeR's workflow design, control, and visualization functionalities interactively without requiring command-line knowledge. It also integrates highly interactive visualization functionalities and a graphics workbench.
Chapter 4 is a large-scale discovery project of bioactive natural compounds. First, the known structures and annotations of about 0.5 million natural compounds and about 4,000 known drugs were organized in a rational database. Second, the assembled database was used for virtual screening, where a series of supervised and non-supervised machine learning methods was applied to predict drug-like natural product candidates. This also identified drugs that have natural compound alternatives with identical structures and vice versa. Third, known drug-target annotations were used to systematically identify which molecular mechanisms and disease processes are perturbable by bioactive natural compounds. This study identified a large panel of new bioactive natural compounds with interesting applications in human health, including healthy aging and anticancer.