Efficient large-scale machine learning algorithms for genomic sequences
High-throughput sequencing (HTS) has led to many breakthroughs in basic and translational biology research. With this technology, researchers can interrogate whole genomes at single-nucleotide resolution. The large volume of data generated by HTS experiments necessitates the development of novel algorithms that can efficiently process these data. At the advent of HTS, several rudimentary methods were proposed. Often, these methods applied compromising strategies such as discarding a majority of the data or reducing the complexity of the models. This thesis focuses on the development of machine learning methods for efficiently capturing complex patterns from high volumes of HTS data.
First, we focus on on de novo motif discovery, a popular sequence analysis method that predates HTS. Given multiple input sequences, the goal of motif discovery is to identify one or more candidate motifs, which are biopolymer sequence patterns that are conjectured to have biological significance. In the context of transcription factor (TF) binding, motifs may represent the sequence binding preference of proteins. Traditional motif discovery algorithms do not scale well with the number of input sequences, which can make motif discovery intractable for the volume of data generated by HTS experiments. One common solution is to only perform motif discovery on a small fraction of the sequences. Scalable algorithms that simplify the motif models are popular alternatives. Our approach is a stochastic method that is scalable and retains the modeling power of past methods.
Second, we leverage deep learning methods to annotate the pathogenicity of genetic variants. Deep learning is a class of machine learning algorithms concerned with deep neural networks (DNNs). DNNs use a cascade of layers of nonlinear processing units for feature extraction and transformation. Each layer uses the output from the previous layer as its input. Similar to our novel motif discovery algorithm, artificial neural networks can be efficiently trained in a stochastic manner. Using a large labeled dataset comprised of tens of millions of pathogenic and benign genetic variants, we trained a deep neural network to discriminate between the two categories. Previous methods either focused only on variants lying in protein coding regions, which cover less than 2% of the human genome, or applied simpler models such as linear support vector machines, which can not usually capture non-linear patterns like deep neural networks can.
Finally, we discuss convolutional (CNN) and recurrent (RNN) neural networks, variations of DNNs that are especially well-suited for studying sequential data. Specifically, we stacked a bidirectional recurrent layer on top of a convolutional layer to form a hybrid model. The model accepts raw DNA sequences as inputs and predicts chromatin markers, including histone modifications, open chromatin, and transcription factor binding. In this specific application, the convolutional kernels are analogous to motifs, hence the model learning is essentially also performing motif discovery. Compared to a pure convolutional model, the hybrid model requires fewer free parameters to achieve superior performance. We conjecture that the recurrent layer allows our model spatial and orientation dependencies among motifs better than a pure convolutional model can. With some modifications to this framework, the model can accept cell type-specific features, such as gene expression and open chromatin DNase I cleavage, to accurately predict transcription factor binding across cell types. We submitted our model to the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge, where it was among the top performing models. We implemented several novel heuristics, which significantly reduced the training time and the computational overhead. These heuristics were instrumental to meet the Challenge deadlines and to make the method more accessible for the research community.
HTS has already transformed the landscape of basic and translational research, proving itself as a mainstay of modern biological research. As more data are generated and new assays are developed, there will be an increasing need for computational methods to integrate the data to yield new biological insights. We have only begun to scratch the surface of discovering what is possible from both an experimental and a computational perspective. Thus, further development of versatile and efficient statistical models is crucial to maintaining the momentum for new biological discoveries.