Skip to main content
Open Access Publications from the University of California

Recent Work

A center at the University of California San Francisco campus providing data analytic and statistical support to investigators engaged in molecular biologic, genomic and genetics research projects.

Cover page of Identification of yeast transcriptional regulation networks using

Identification of yeast transcriptional regulation networks using


The recent availability of whole-genome scale data sets that investigate complementary and diverse aspects of transcriptional regulation has spawned an increased need for new and effective computational approaches to analyze and integrate these large scale assays. Here, we propose a novel algorithm, based on random forest methodology, to relate gene expression (as derived from expression microarrays) to sequence features residing in gene promoters (as derived from DNA motif data) and transcription factor binding to gene promoters (as derived from tiling microarrays). We extend the random forest approach to model a multivariate response as represented, for example, by time-course gene expression measures. An analysis of the multivariate random forest output reveals complex regulatory networks, which consist of cohesive, condition-dependent regulatory cliques. Each regulatory clique features homogeneous gene expression profiles and common motifs or synergistic motif groups. We apply our method to several yeast physiological processes: cell cycle, sporulation, and various stress conditions. Our technique displays excellent performance with regard to identifying known regulatory motifs, including high order interactions. In addition, we present evidence of the existence of an alternative MCB-binding pathway, which we confirm using data from two independent cell cycle studies and two other physioloigical processes. Finally, we have uncovered elaborate transcription regulation refinement mechanisms involving PAC and mRRPE motifs that govern essential rRNA processing. These include intriguing instances of differing motif dosages and differing combinatorial motif control that promote regulatory specificity in rRNA metabolism under differing physiological processes.

Cover page of A Novel Topology for Representing Protein Folds

A Novel Topology for Representing Protein Folds


Various topologies for representing three dimensional protein structures have been advanced for purposes ranging from prediction of folding rates to ab initio structure prediction. Examples include relative contact order, Delaunay tessellations, and backbone torsion angle distributions. Here we introduce a new topology based on a novel means for operationalizing three dimensional proximities with respect to the underlying chain. The measure involves first interpreting a rank-based representation of the nearest neighbors of each residue as a permutation, then determining how perturbed this permutation is relative to an unfolded chain. We show that the resultant topology provides improved association with folding and unfolding rates determined for a set of two-state proteins under standardized conditions. Furthermore, unlike existing topologies, the proposed geometry exhibits fine scale structure with respect to sequence position along the chain, potentially providing insights into folding initiation and/or nucleation sites.

Cover page of Selective Genotyping and Phenotyping Strategies in a Complex Trait Context

Selective Genotyping and Phenotyping Strategies in a Complex Trait Context


Selective genotyping and phenotyping strategies can reduce the cost of QTL (quantitative trait loci) experiments. We analyze selective genotyping and phenotyping strategies in the context of multi-locus models, and non-normal phenotypes. Our approach is based on calculations of the expected information of the experiment under different strategies. Our central conclusions are the following. (1) Selective genotyping is effective for detecting linked and epistatic QTL as long as no locus has a large effect. When one or more loci have large effects, the effectiveness of selective genotyping is unpredictable – it may be heightened or diminished relative to the small effects case. (2) Selective phenotyping efficiency decreases as the number of unlinked loci used for selection increases, and approaches random selection in the limit. However, when phenotyping is expensive, and a small fraction can be phenotyped, the efficiency of selective phenotyping is high compared to random sampling, even when over 10 loci are used for selection. (3) For time-to-event phenotypes such as lifetimes, which have a long right tail, right-tail selective genotyping is more effective than two-tail selective genotyping. For heavy-tailed phenotype distributions, such as the Cauchy distribution, the most extreme phenotypic individuals are not the most informative. (4) When the phenotype distribution is exponential, and a right-tail selective genotyping strategy is used, the optimal selection fraction (proportion genotyped) is less than 20%or 100% depending on genotyping cost. (5) For time-to-event phenotypes where followup cost increases with the lifetime of the individual, we derive the optimal followup time that maximizes the information content of the experiment relative to its cost. For example, when the cost of following up an individual for the average lifetime in the population is approximately equal to the fixed costs of genotyping and breeding, the optimal strategy is to follow up approximately 70% of the population.

Cover page of Re-Cracking the Nucleosome Positioning Code

Re-Cracking the Nucleosome Positioning Code


Nucleosomes, the fundamental repeating subunits of all eukaryotic chromatin, are responsible for packaging DNA into chromosomes inside the cell nucleus and controlling gene expression. While it has been well established that nucleosomes exhibit higher affinity for select DNA sequences, until recently it was unclear whether such preferences exerted a significant, genome-wide effect on nucleosome positioning in vivo. This question was seemingly and recently resolved in the affirmative: a wide-ranging series of experimental and computational analyses provided extensive evidence that the instructions for wrapping DNA around nucleosomes are contained in the DNA itself. This subsequently labelled second genetic code was based on data-driven, structural, and biophysical considerations. It was subjected to an extensive suite of validation procedures, with one conclusion being that intrinsic, genome-encoded, nucleosome organization explains _50% of in vivo nucleosome positioning. Here, we revisit both the nature of the underlying sequence preferences, and the performance of the proposed code. A series of new analyses, employing spectral envelope (Fourier transform) methods for assessing key sequence periodicities, classification techniques for evaluating predictive performance, and discriminatory motif finding methods for devising alternate models, are applied. The findings from the respective analyses indicate that signature dinucleotide periodicities are absent from the bulk of the high affinity nucleosome-bound sequences, and that the predictive performance of the code is modest. We conclude that further exploration of the role of sequence-based preferences in genome-wide nucleosome positioning is warranted. This work offers a methodologic counterpart to a recent, high resolution determination of nucleosome positioning that also questions the accuracy of the proposed code and, further, provides illustration of techniques useful in assessing sequence periodicity and predictive performance.

Cover page of On E-values for Tandem MS Scoring Schemes

On E-values for Tandem MS Scoring Schemes


In a recent article in this journal, Khatun, Hamlett, and Giddings (2008) (KHG) advance a new scoring scheme for use in conjunction with tandem mass spectrometry (MS/MS) based peptide identification. As they note, such identifications are fundamental to much proteomics research but, due to MS/MS data complexity and the scale of attendant database searches, their accuracy is limited. The scoring technique they propose, which employs a hidden Markov model (HMM) over a set of states that represent key features of MS/MS data, is convincingly motivated and exhibits good performance. The purpose of this brief note is to critique the method chosen for calibrating the HMM scores, rather than the genesis of the scores themselves.

Cover page of A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays

A multi-array multi-SNP genotyping algorithm for Affymetrix SNP microarrays


Motivation: Modern strategies for mapping disease loci require efficient genotyping of a large number of known polymorphic sites in the genome. The sensitive and high-throughput nature of hybridization-based DNA microarray technology provides an ideal platform for such an application by interrogating up to hundreds of thousands of single nucleotide polymorphisms (SNPs) in a single assay. Similar to the development of expression arrays, these genotyping arrays pose many data analytic challenges that are often platform specific. Affymetrix SNP arrays, e.g. use multiple sets of short oligonucleotide probes for each known SNP, and require effective statistical methods to combine these probe intensities in order to generate reliable and accurate genotype calls.

Results: We developed an integrated multi-SNP, multi-array genotype calling algorithm for Affymetrix SNP arrays, MAMS, that combines single-array multi-SNP (SAMS) and multi-array, single- SNP (MASS) calls to improve the accuracy of genotype calls, without the need for training data or computation-intensive normalization procedures as in other multi-array methods. The algorithm uses resampling techniques and model-based clustering to derive single array based genotype calls, which are subsequently refined by competitive genotype calls based on (MASS) clustering. The resampling scheme caps computation for single-array analysis and hence is readily scalable, important in view of expanding numbers of SNPs per array. The MASS update is designed to improve calls for atypical SNPs, harboring allele-imbalanced binding affinities, that are difficult to genotype without information from other arrays. Using a publicly available data set of HapMap samples from Affymetrix, and independent calls by alternative genotyping methods from the HapMap project, we show that our approach performs competitively to existing methods.

Cover page of Validation in Genomics: CpG Island Methylation Revisited

Validation in Genomics: CpG Island Methylation Revisited


In a recent article in PLoS Genetics, Bock et al., (2006) undertake an extensive computational epigenetics analysis of the ability of DNA sequence-derived features, capturing attributes such as tetramer frequencies, repeats and predicted structure, to predict the methylation status of CpG islands. Their suite of analyses appears highly rigorous with regard to accompanying validation procedures, employing stringent Bonferroni corrections, stratified cross-validation, and follow-up experimental verification. Here, however, we showcase concerns with the validation steps, in part ascribable to the genome scale of the investigation, that serve as a cautionary note and indicate the heightened need for careful selection of analytic and companion validation methods. A series of new analyses of the same CpG island methylation data helps illustrate these issues, not just for this particular study, but also analogous investigations involving high-dimensional predictors with complex between-feature dependencies.

Cover page of R/qtlDesign: Inbred Line Cross Experimental Design

R/qtlDesign: Inbred Line Cross Experimental Design


An investigator planning a QTL (quantitative trait locus) experiment has to choose which strains to cross, the type of cross, genotyping strategies, and the number of progeny to raise and phenotype. To help make such choices, we have developed an interactive program for power and sample size calculations for QTL experiments, R/qtlDesign. Our software includes support for selective genotyping strategies, variable marker spacing, and tools to optimize information content subject to cost constraints, for backcross, intercross, and recombinant inbred lines from two parental strains. We review the impact of experimental design choices on the variance attributable to a segregating locus, the residual error variance, and the effective sample size. We give examples of software usage in real-life settings. The software is available at

Cover page of Chess, Chance and Conspiracy

Chess, Chance and Conspiracy


Chess and chance are seemingly strange bedfellows. Luck and/or randomness have no apparent role in move selection when the game is played at the highest levels. However, when competition is at the ultimate level, that of the World Chess Championship (WCC), chess and conspiracy are not strange bedfellows, there being a long and colorful history of accusations levied between participants. One such accusation, frequently repeated, was that all the games in the 1985 WCC (Karpov vs Kasparov) were fixed and pre-arranged move-by-move. That this claim was advanced by a former World Champion, Bobby Fischer, argues that it ought be investigated. That the only published, concrete basis for this claim consists of an observed run of particular moves, allows this investigation to be performed using probabilistic and statistical methods. In particular, we employ imbedded finite Markov chains to evaluate run statistic distributions. Further, we demonstrate how both chess computers and game databases can be brought to bear on the problem.