Skip to main content
eScholarship
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Statistical Aspects of ChIP-Seq Data Analysis

Abstract

ChIP-Seq experiments combine the recently developed next-generation sequencing technology with the established chromatin immunoprecipitation assays to study the interactions between various classes of proteins and DNA in the cell nucleus. The experiments consist of isolating the protein-DNA complexes from the nucleus, enriching the pool of DNA fragments for those bound to the protein of interest, and sequencing the resulting pool of fragments, producing millions of short reads that can be aligned to the genome. Despite the fact that the ChIP-Seq technology has been developed very recently, a great number of studies have been carried out on the DNA binding of a variety of transcription factors in different species and tissue types. ChIP-Seq approaches have also been used to study cellular epigenomic states such as histone modifications.

As with any nascent technology, a number of methodological issues need to be addressed before a proper data analysis pipeline for ChIP-Seq can be established. Some of the issues that need to be addressed are image processing and analysis, alignment of the reads to a genome or a subset of it, and identifying the signal sites along the genome. This work focuses on the issue of signal identification, the problem known as peak-finding in the literature.

We describe the data-generating process for ChIP-Seq experiments and review properties of the data and various sources of biases in Chapter 1. We then review various approaches to peak-finding in Chapter 2. We provide a detailed overview of some common strategies, their relative advantages and disadvantages, and describe the statistical models used by some popular peak-finding tools. We formalize the conceptual framework of peak-finding by introducing the notions of enrichment measures and enrichment statistics and categorize various peak-finders in terms of this framework. We discuss in some detail the different kinds of control samples used in ChIP-Seq experiments, and how they are incorporated into the peak-finding procedure. We also address the important issue of validation in the context of ChIP-Seq experiments and the shortcomings of the currently available validation approaches.

In Chapter 3 we propose a novel peak-finding strategy for experiments involving trancription factor binding that lack appropriate control samples (so-called one-sample experiments). Our approach accounts for genomic sequence biases in the data, namely the GC and mappability effects, and utilizes the knowledge of the shape of the read density profile in the vicinity of the true binding sites. We use deduced sets of true positive and true negative enriched regions to demonstrate that our approach is better at removing non-specifically enriched regions from the set of identified binding sites than other one-sample approaches and provides a superior spatial resolution to most examined peak-finders.

Finally, in Chapter 4 we discuss the important issue of combining data from replicate samples. We discuss different kinds of replicates common in the ChIP-Seq literature and the standard approaches used to integrate data across replicates. We develop several diagnostic plots for assessing whether the standard assumption of Poisson variance holds and observe that the assumption can break down even for technical replicates due to flow cell-specific sequence composition effects.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View