With the recent advancements in sequencing technologies, molecular biologists are producing ever-increasing amounts of biomolecular data. Extracting useful information from these massive data sets requires efficient and effective data mining and machine learning methods. In this dissertation, we explore the use of supervised machine learning (ML) to solve some challenging classification problems in molecular biology.
First, we devise an ML model for classifying cancer types from very sparse somatic point mutation data. Accumulation of mutation and epigenetic modifications in somatic cells results in various cancer. For this purpose, we propose a method called mClass for efficient feature (gene) ranking that uses clustering, normalized mutual information and logistic regression. We show that somatic mutation data has sufficient discriminative power for cancer type classification.
Next, we address the problem of gene essentiality prediction in microbes. Essential genes are significant to identify since their function is vital for the survival of the organism. Our proposed deep learning architecture called DeeplyEssential exclusively uses features extracted from the primary sequence of genes and their corresponding proteins, to maximize the utility and practicality of the tool. DeeplyEssential achieved state-of-the-art performance over previously proposed methods as well as expose and study a hidden performance bias affected previous models.
Finally, we consider the problem of predicting the enhancer regions in the human genome from chromatin data. Enhancers contribute to the transcription of target genes. We propose a convolutional neural network framework named Epi2En that takes advantage of epigenetic ChIP-seq data. Epi2En's classification performance is not only very strong on cross-validation experiments, but also when testing across different cell-lines.