This paper surveys and extends models and algorithms for identifying binding sites in non-coding regions of DNA. These sites control the transcription of genes into messenger RNA in preparation for translation into proteins. We summarize the underlying biology, review three different models for binding site identification, and present a unified model that borrows from the previous models and integrates their main features. We then describe maximum likelihood and maximum a posteriori algorithms for fitting the unified model to data. Finally, we conclude with a prospectus of future data analyses and theoretical research.

# Your search: "author:"Sabatti, Chiara""

## filters applied

## Type of Work

Article (24) Book (0) Theses (1) Multimedia (0)

## Peer Review

Peer-reviewed only (13)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (1) UC Davis (1) UC Irvine (0) UCLA (23) UC Merced (0) UC Riverside (0) UC San Diego (1) UCSF (6) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (0) Lawrence Berkeley National Laboratory (0) UC Agriculture & Natural Resources (0)

## Department

Department of Statistics, UCLA (12) School of Medicine (1)

## Journal

## Discipline

## Reuse License

## Scholarly Works (25 results)

The identification of binding sites for regulatory proteins in the up-stream region of genes is an important ingredient towards the understanding of transcription regulation. In recent years, novel experimental techniques, as gene expression arrays, and the availability of entire genome sequences have opened the possibility for more detailed investigations in this domain. Traditionally, the reconstruction of the profile of a binding site and the localization of all its occurrences in a sequence are treated as separate problems. The first is tackled using a small group of sequences, known or suspected to contain the binding site, but with neither position or pattern known. One successful approach to such reconstruction problem is based on a probabilistic model of the sequence, represented as concatenation of background and motif stochastic words. Maximum likelihood or maximum a-posteriori estimates are obtained with EM or Gibbs-sampler algorithms [13, 14]. The second problem is approached considering one or multiple sequences of variable length; the pattern characterizing the motif is assumed known. Possible locations are identified on the base of scoring functions that highlight the similarity of the motif with the sequence portions. Cut off values for such similarity scores are hard to determine: ad hoc solutions or estimations on a training set are often adopted [17, 18]. Typically these techniques are used to scan one sequence of interest against a data-base of known binding sites. While there are historical and practical reasons to consider these two problems as separate, the current post-genomic era, where we are confronted with large abundance of sequence, calls for a different approach. Consider the problem, tackled in [18], of identifying all the the binding sites of the known regulatory proteins in the genome of E. Coli. While formally similar to blasting a small sequence of interest against a data-base of known regulatory proteins, there are substantial differences in these genome-wide search. On the one hand, as one scans through the genome for binding sites of LexA—to take one example—and finds a substantial number of them, it seems appropriate one should use the information in the identified locations to update the current pattern description. On the other hand, given that the output is not going to include a small number of sites, that can be further investigated, but a large collection of them, the assessment of significance cut-off should be based on proper probabilistic statements. To address these issues, one would need a probability model for the entire genome sequence, that can lead to evaluation of specific a-posteriori probabilities of appearance of a binding site in any given location, and whose parameters can be estimated on the base of data. At the same time, given the scale of the problem, the model should be suitable for rapid computation. In an attempt to address such need we introduce here the Vocabulon model. Section 2 gives a description of the probability model we employ; its differences from others in the literature; and its current implementation. We then present the results of multiple investigations on E. Coli sequence. Given that genome-wide information on the location of binding sites is not available, we used results of gene expression array experiments to corroborate our results, arguing in favor of a novel perspective in array analysis.

We describe a framework where DNA sequence information and expression arrays data are used in concert to analyze the effects of a collection of regulatory proteins on genomic expres- sion levels. The search for potential binding sites in sequence data leads to the identification of potential target genes for each transcription factor. The analysis of array data with a Bayesian hidden component model allows us to identify which of the potential binding sites are actually used by the regulatory proteins in the studied cell conditions, the strength of their control, and their activation profile in a series of experiments. We apply our methodology to 35 expression studies in E. Coli.

Affymetrix’s SNP (single nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene mapping studies. Because each SNP is queried by mul- tiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of a SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. The present pa- per describes our attempt to improve genotype calling by constructing Gaussian penetrance models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, Gaussian penetrance models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype calling model with pedigree analysis and examine a sequence of sym- metry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the BIC criterion, we select the best combi- nation of symmetry assumptions. Compared to the genotype calling results obtained from Affymetrix’s software, we are able to reduce the number of no-calls substantially and quan- tify the level of confidence in all calls. Once pedigree analysis software can accommodate soft penetrances, we can expect to see more reliable association and linkage studies with less wasted genotyping data.

Gene microarray technology is often used to compare the expression of thousand of genes in two different cell lines. Typically, one does not expect measurable changes in transcription amounts for a large number of genes; furthermore, the noise level of array experiments is rather high in relation to the available number of replicates. For the purpose of statistical analysis, inference on the “population” difference in expression for genes across the two cell lines is often cast in the framework of hypothesis testing, with the null hypothesis being no change in expression. Given that thousands of genes are investigated at the same time, this requires some multiple comparison correction procedure to be in place. We argue that hypothesis testing, with its emphasis on type I error and family analogues, may not address the exploratory nature of most microarray experiments. We instead propose viewing the problem as one of estimation of a vector known to have a large number of zero components. In a Bayesian framework, we describe the prior knowledge on expression changes using mixture priors that incorporate a mass at zero and we choose a loss function that favors the selection of sparse solutions. We consider two different models applicable to the microarray problem, depending on the nature of replicates available, and show how to explore the posterior distributions of the parameters using MCMC. Simulations show an interesting connection between this Bayesian estimation framework and both false discovery rate (FDR) control, and misclassification minimizing pro- cedures. Finally, two empirical examples illustrate the practical advantages of this Bayesian estimation paradigm

We discuss the value of volume measures for linkage disequilibrium, showing how they are robust to small sample variation and easily generalized to multi-allelic markers. In particular we introduce Dvol, a volume analogue to D' and show that it performs substantially better when the sample size is small to moderate. Mvol is proposed as a generalization of this measure to multi-allelic markers. Finally a measure based on homozygosity Hvol is suggested as a generalization of R^2. To evaluate these measures, we introduce a sequential importance sampling algorithm. We illustrate their performance on simulated and real data.

Population based linkage disequilibrium genome screens represent one of the most recent approaches for the lo- calization of genes responsible for complex diseases. One open problem in this context is represented by the definition of an appropriate significance threshold that takes into account the multiple comparison problem. We explore the con- ceptual and practical implications of the multiple testing procedure known as False Discovery Rate (FDR). We argue that controlling the FDR better represents the interest of researcher in this area than more traditional approaches. We then explore the applicability of the Benjamini-Hochberg (BH) FDR controlling procedure in the specific context of association mapping from case-control data. We analyze the nature of dependency between the test statistics with an- alytic work and simulations and we conclude that the BH rule effectively controls FDR in our context of interest. The dependency between test statistics translates into a decrease of power, which highlights the necessity of developing resampling based rules to control FDR.

We propose a dictionary model for haplotypes. According to the model, a haplotype is con- structed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that ac- count for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is a particularly difficult because of the variable dimension of the model space. We define a mini- mum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given data set. Application of the model to simulated data gives encouraging results. In a real data set, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.

We consider array experiments that compare expression levels of a high number of genes in two cell lines with few repetitions and with no subject effect. We develop a statistical model that illustrates under which assumptions thresholding is optimal in the analysis of such microarray data. The results of our model explain the success of the empirical rule of 2-fold change. We illustrate a thresholding procedure that is adaptive to the noise level of the experiment, the amount of genes analyzed, and the amount of genes that truly change expression level. This procedure, in a world of perfect knowledge on noise distribution, would allow reconstruction of a sparse signal, minimizing the false discovery rate. Given the amount of information actually available, the thresholding rule described provides a reasonable estimator for the change in expression of any gene in two compared cell-lines.