Heterogeneity in DNA Multiple Alignments: Modeling, Inference, and Applications in Motif Finding
Transcription factors bind sequence-speciﬁc sites in DNA to regulate gene transcription. Identifying transcription factor binding sites (TFBS’s) is an important step for understanding gene regulation. Although sophisticated in modeling TFBS’s and their combinatorial patterns, computational methods for TFBS detection and motif ﬁnding often make oversimpliﬁed homogeneous model assumptions for background sequences. Since nucleotide base composition varies across genomic regions, it is expected to be helpful for motif ﬁnding to incorporate the heterogeneity into background modeling. When sequences from multiple species are utilized, variation in evolutionary conservation violates the common assumption of an identical conservation level in multiple alignments. To handle both types of heterogeneity, we propose a generative model in which a segmented Markov chain is used to partition a multiple alignment into regions of homogeneous nucleotide base composition and a hidden Markov model (HMM) is employed to account for diﬀerent conservation levels. Bayesian inference on the model is developed via Gibbs sampling with dynamic programming recursions. Simulation studies and empirical evidence from biological data sets reveal the dramatic eﬀect of background modeling on motif ﬁnding, and demonstrate that the proposed approach is able to achieve substantial improvements over commonly used background models.