Contrast Learning on ChIP-Seq Data of Transcription Factors
- Author(s): Lee, Yuju
- Advisor(s): Zhou, Qing
- et al.
In this study, we analyzed the TF ChIP-Seq data of 105 (i.e., 15 choose 2) pairs. Each pair is based on two TF and three binding-dependent (BD) sequence datasets. The BD were generated from the two TF ChIP-Seq datasets in each pair. That is, the three scenario datasets are containing TFBS sequences of type 1, 2 or both (i.e., 1 and 2) TF.
The objective is to identify motif 1, 2 or even both (i.e., interactive motifs) by contrasting two of the three BD datasets at a time by using the contrast-motif-finder (CMF) algorithm. Each of the CMF's output not only provides estimated consensus motifs based on its full name PWM but also provides likelihood ratios (LRs) as a measure of the enrichment of an identified motif. Using this idea, we
construct a dataset where the first column lists the locations of identified enriched motif in the genome, column 2 to n+1 contains the estimated consensus motifs and the last column shows a binary (i.e., 0/1) of which set it is from and n is the number of consensus motifs.
Once these datasets are obtained, we use statistical model such as logistics regression, support vector machine (SVM) and classification tree models to determine their performance (i.e., error rates) and selection power. We have shownthat the SVM Radial kernel seems to have the best performance when using all the motifs in the dataset whereas classification tree selects the fewest motifs in almost every analyzed datasets but at the same time, the error rates and selection power do not drop as much. As a result, we believe the classification tree model is a better model since it not only provides a competitive predictive power with simpler models but also takes far less computational time than the other two models.