Next generation sequencing technology has led to a deluge of genomic data. At first, this was limited to sequencing the genotype, or base pair sequence, of an organism, but then was extended to detect which regions of the DNA were associated with molecular markers, had specific structure, or other characteristics – jointly called “epigenomics”. However, interpreting this data has proven to be quite difficult, due to its size, complexity, and lack of understanding of underlying biology. To help unravel this data, we turn to machine learning models, which have been effective in fields with similar difficulties. We elaborate on background and motivation in chapter one. In the following chapters, we will describe several novel machine learning methods we developed to address key problems in epigenomics.
In chapter two, we describe a method we developed, χ-SCNN, to computationally increase the resolution of an experiment that measures the three-dimensional structure of the genome. χ-SCNN uses related epigenomic data (ChIP-seq, which measures molecular marks on DNA-associated proteins called histones, dubbed ‘histone marks’, and DNase-seq, which measures general accessibility of DNA, respectively) to train a model to infer the source of DNA interactions. We show that it robustly fine-maps coarse interactions and predicts locations of functionally relevant regions.
In chapter three, we describe a method we developed called HMX, a clustering method to annotate protein-coding genes based on their epigenetic landscapes. We show that it can be integrated with expression data to learn more about a specific cell-type. In chapter four, we outline an extension of HMX, called EMX, which can be used to annotate subunits of a gene called exons and introns. These annotations can then be used to compare specific parts of different genes to each other. We present preliminary results from this project.
In chapter five, we describe a method we developed called ChIPs n DIP, which can be used to deconvolve ChIP-seq signal into the sum of its direct and indirect components, and we present preliminary results. Finally, we summarize our conclusions in chapter six.