Nucleosomes are the basic elements of DNA chromatin structure. Not only they control DNA packaging but also play a critical role in gene regulation by allowing physical access to transcription factors. In addition to providing the positions of nucleosomes and the occupancy level it is becoming more and more important to resolve possible overlaps, extract additional information about nucleosomes like the probability of placement, and determine whether they are well-positioned or "fuzzy" in the sequenced cell sample.
In this dissertation, we address some of the computational issues associated with the analysis of sequencing data enriched for nucleosomes. We propose two novel algorithms to create nucleosome maps, for single- and paired-end sequencing data respectively. Then, we study the problem of aligning these maps.
The first method, called NOrMAL, is based on a novel parametric probabilistic model of a nucleosome. Expectation maximization is used to learn the parameters of a Gaussian mixture model. Extensive experiments on real and synthetic data shows that our method can produce very accurate maps, and can detect a larger number of nucleosomes than published tools.
The second method, called PuFFIN, takes advantage of paired-end short reads to build genome-wide nucleosome maps. In contrast to other approaches that require users to optimize several parameters according to their data (e.g., the maximum allowed nucleosome overlap or legal ranges for the fragment sizes) our algorithm can accurately determine a genome-wide set of non-overlapping nucleosomes without any user-defined parameter. On the real data PuFFIN detects stronger associations between nucleosome occupancy and gene expression levels compared to other tools, which indicates that our tool extracts more biologically-relevant features from the data.
Finally, we then study the problem of aligning nucleosome maps, which is NP-complete when the number of maps is three or more. We use effective bounding tricks to limit the size of the problem and use linear programming to solve it. Our evaluations on the synthetic data shows that our aligning tool consistently outperforms the naive (greedy) approach and it is faster than dynamic programming.