In this dissertation, we investigate two problems in computational biologythat can be solved using machine learning methods, specifically using deep
learning architectures.
In this first, we study the problem of predicting histonepost-translational modifications (PTMs) from transcription factor binding
data and the primary DNA sequence. Histone PTMs are involved in a variety
of essential regulatory processes in the cell, including transcription
control. Here we introduce a deep learning architecture called DeepPTM for
predicting histone PTMs. Extensive experimental results show that DeepPTM
outperforms the prediction accuracy of the model proposed in Benveniste et
al. (PNAS, 2014) and DeepHistone (BMC Genomics, 2019). The competitive
advantage of our framework lies in the synergistic use of deep learning
combined with an effective pre-processing step. Our classification
framework has also enabled the discovery that the knowledge of a small
subset of transcription factors (which are histone-PTM and
cell-type-specific) can provide almost the same prediction accuracy that
can be obtained using all the transcription factors data.
In the second, we investigate the problem of predicting single guide RNA(sgRNA) CRISPR-Cas9 and CRISPR-Cas12a activity from the primary sequence
of the sgRNA. A negative selection screen in the absence of non-homologous end-joining (the dominant DNA repair mechanism) is used to generate single guide RNA (sgRNA) activity profiles for both SpCas9 and LbCas12a for the non-conventional yeasts \emph{Yarrowia lipolytica} and \emph{Kluyveromyces marxianus}. This genome-wide data serves as input to a deep learning algorithm, DeepGuide, that is able to accurately predict guide activity. DeepGuide uses unsupervised learning to obtain a compressed representation of the genome, followed by supervised learning to map sgRNA sequence, genomic context, and epigenetic features with guide activity. Experimental validation, both genome-wide and with a subset of selected genes, confirms DeepGuide’s ability to accurately predict high activity sgRNAs.
We also show that the prediction accuracy of DeepGuide can be further improved by incorporating sgRNA samples from different screening conditions of the genome-wide library based on carbon source (glucose, xylose, and lactose) and the temperature at which the non-conventional yeast is grown. To the best of our knowledge, our method is the first sgRNA predictive tool that employ guides from different screening conditions to improve the prediction performance.