Search

Scholarly Works (47 results)

Sort By:

Show:

Article

Genomewide Motif Identification Using a Dictionary Model

Department of Statistics Papers (2002)

This paper surveys and extends models and algorithms for identifying binding sites in non-coding regions of DNA. These sites control the transcription of genes into messenger RNA in preparation for translation into proteins. We summarize the underlying biology, review three different models for binding site identification, and present a unified model that borrows from the previous models and integrates their main features. We then describe maximum likelihood and maximum a posteriori algorithms for fitting the unified model to data. Finally, we conclude with a prospectus of future data analyses and theoretical research.

Cover page: Genomewide Motif Identification Using a Dictionary Model

Article

Bayesian Gaussian Mixture Models for High Density Genotyping Arrays

Department of Statistics Papers (2005)

Affymetrix’s SNP (single nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene mapping studies. Because each SNP is queried by mul- tiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of a SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. The present pa- per describes our attempt to improve genotype calling by constructing Gaussian penetrance models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, Gaussian penetrance models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype calling model with pedigree analysis and examine a sequence of sym- metry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the BIC criterion, we select the best combi- nation of symmetry assumptions. Compared to the genotype calling results obtained from Affymetrix’s software, we are able to reduce the number of no-calls substantially and quan- tify the level of confidence in all calls. Once pedigree analysis software can accommodate soft penetrances, we can expect to see more reliable association and linkage studies with less wasted genotyping data.

Cover page: Bayesian Gaussian Mixture Models for High Density Genotyping Arrays

Article
Peer Reviewed

A Legacy of EM Algorithms

UCLA Previously Published Works (2022)

Nan Laird has an enormous and growing impact on computational statistics. Her paper with Dempster and Rubin on the expectation-maximisation (EM) algorithm is the second most cited paper in statistics. Her papers and book on longitudinal modelling are nearly as impressive. In this brief survey, we revisit the derivation of some of her most useful algorithms from the perspective of the minorisation-maximisation (MM) principle. The MM principle generalises the EM principle and frees it from the shackles of missing data and conditional expectations. Instead, the focus shifts to the construction of surrogate functions via standard mathematical inequalities. The MM principle can deliver a classical EM algorithm with less fuss or an entirely new algorithm with a faster rate of convergence. In any case, the MM principle enriches our understanding of the EM principle and suggests new algorithms of considerable potential in high-dimensional settings where standard algorithms such as Newton's method and Fisher scoring falter.

Article
Peer Reviewed

Enhancements to the ADMIXTURE Algorithm for Individual Ancestry Estimation

UCLA Previously Published Works (2011)

Abstract Background The estimation of individual ancestry from genetic data has become essential to applied population genetics and genetic epidemiology. Software programs for calculating ancestry estimates have become essential tools in the geneticist's analytic arsenal. Results Here we describe four enhancements to ADMIXTURE, a high-performance tool for estimating individual ancestries and population allele frequencies from SNP (single nucleotide polymorphism) data. First, ADMIXTURE can be used to estimate the number of underlying populations through cross-validation. Second, individuals of known ancestry can be exploited in supervised learning to yield more precise ancestry estimates. Third, by penalizing small admixture coefficients for each individual, one can encourage model parsimony, often yielding more interpretable results for small datasets or datasets with large numbers of ancestral populations. Finally, by exploiting multiple processors, large datasets can be analyzed even more rapidly. Conclusions The enhancements we have described make ADMIXTURE a more accurate, efficient, and versatile tool for ancestry estimation.

Cover page: Enhancements to the ADMIXTURE Algorithm for Individual Ancestry Estimation

Article
Peer Reviewed

A Look at the Generalized Heron Problem through the Lens of Majorization-Minimization

UCLA Previously Published Works (2014)

In a recent issue of this journal, Mordukhovich, Nam, and Salinas pose and solve an interesting non-differentiable generalization of the Heron problem in the framework of modern convex analysis. In the generalized Heron problem, one is given k + 1 closed convex sets in ℝ ^d equipped with its Euclidean norm and asked to find the point in the last set such that the sum of the distances to the first k sets is minimal. In later work, the authors generalize the Heron problem even further, relax its convexity assumptions, study its theoretical properties, and pursue subgradient algorithms for solving the convex case. Here, we revisit the original problem solely from the numerical perspective. By exploiting the majorization-minimization (MM) principle of computational statistics and rudimentary techniques from differential calculus, we are able to construct a very fast algorithm for solving the Euclidean version of the generalized Heron problem.

Cover page: A Look at the Generalized Heron Problem through the Lens of Majorization-Minimization

Article

Sharp Quadratic Majorization in One Dimension

Department of Statistics Papers (2006)

Quadratic ma jorizations for real-valued functions of a real variable are analyzed, and the concept of sharp ma jorization is introduced and studied. Applications to logistic and robust loss functions are discussed.

Article
Peer Reviewed

Sharp Quadratic Majorization in One Dimension

Department of Statistics Papers (2006)

Article

Reconstructing Ancestral Haplotypes with a Dictionary Model

Department of Statistics Papers (2005)

We propose a dictionary model for haplotypes. According to the model, a haplotype is con- structed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that ac- count for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is a particularly difficult because of the variable dimension of the model space. We define a mini- mum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given data set. Application of the model to simulated data gives encouraging results. In a real data set, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.

Cover page: Reconstructing Ancestral Haplotypes with a Dictionary Model

Thesis
Peer Reviewed

Simulation and Numerical Methods for Stochastic Processes

Stutz, Timothy Charles
Advisor(s): Lange, Kenneth L

UCLA Electronic Theses and Dissertations (2020)

Stochastic processes and randomness are vital features of mathematical modeling in biology.

Unfortunately analytical results are rarely available for even moderately complex

stochastic processes leaving simulation and numerical techniques the main avenues of attack.

We begin this work by exploring coupling bounds for birth-death processes, a fundamental

type of stochastic process that describes how populations of individuals change over

time. By forming a coupling between a truncated version of the process and the original

unbounded version, we are able to compute both moments and transition probabilities for

the true process within an acceptable error bound. Second, we present an algorithm design

framework for Interacting Particle Systems (IPSs). These are complex stochastic processes

with wide application to spatial phenomenon across many scientific disciplines. Here we describe

a method for efficiently sorting particles into classes based off of their type and spatial

configuration in such a fashion that reduces the spatial simulation to that of a non-spatial

well-mixed process, albeit with a more complicated update step. This also allows us to apply

a large suite of well-developed stochastic simulation algorithms to IPSs with little additional

coding cost. Third, we return to numerical methods, this time for multi-type branching

processes applied to gene therapy. We derive a series of ordinary differential equations that

govern the evolution of the probability generating function and provide a straightforward

numerical inversion approach to obtain marginalized probability distributions for probabilistic

quantities of interest. We provide examples of our techniques applied to lentiviral gene

therapy and the associated risk of oncogenesis in transplanted hematopoietic stem cell lines.

Finally, we conclude with a chapter on future directions, both related to the previous three

chapters as well as projects not previously addressed in this work.

Cover page: Simulation and Numerical Methods for Stochastic Processes

Article
Peer Reviewed

A multivariate Bernoulli model to predict DNaseI hypersensitivity status from haplotype data

UCLA Previously Published Works (2015)

Motivation

Haplotype models enjoy a wide range of applications in population inference and disease gene discovery. The hidden Markov models traditionally used for haplotypes are hindered by the dubious assumption that dependencies occur only between consecutive pairs of variants. In this article, we apply the multivariate Bernoulli (MVB) distribution to model haplotype data. The MVB distribution relies on interactions among all sets of variants, thus allowing for the detection and exploitation of long-range and higher-order interactions. We discuss penalized estimation and present an efficient algorithm for fitting sparse versions of the MVB distribution to haplotype data. Finally, we showcase the benefits of the MVB model in predicting DNaseI hypersensitivity (DH) status--an epigenetic mark describing chromatin accessibility--from population-scale haplotype data.

Results

We fit the MVB model to real data from 59 individuals on whom both haplotypes and DH status in lymphoblastoid cell lines are publicly available. The model allows prediction of DH status from genetic data (prediction R2=0.12 in cross-validations). Comparisons of prediction under the MVB model with prediction under linear regression (best linear unbiased prediction) and logistic regression demonstrate that the MVB model achieves about 10% higher prediction R2 than the two competing methods in empirical data.

Availability and implementation

Software implementing the method described can be downloaded at http://bogdan.bioinformatics.ucla.edu/software/.

Contact

shihuwenbo@ucla.edu or pasaniuc@ucla.edu.

Cover page: A multivariate Bernoulli model to predict DNaseI hypersensitivity status from haplotype data