This paper surveys and extends models and algorithms for identifying binding sites in non-coding regions of DNA. These sites control the transcription of genes into messenger RNA in preparation for translation into proteins. We summarize the underlying biology, review three different models for binding site identification, and present a unified model that borrows from the previous models and integrates their main features. We then describe maximum likelihood and maximum a posteriori algorithms for fitting the unified model to data. Finally, we conclude with a prospectus of future data analyses and theoretical research.

# Your search: "author:"Lange, Kenneth""

## filters applied

## Type of Work

Article (24) Book (0) Theses (5) Multimedia (0)

## Peer Review

Peer-reviewed only (25)

## Supplemental Material

Video (0) Audio (0) Images (0) Zip (0) Other files (0)

## Publication Year

## Campus

UC Berkeley (2) UC Davis (0) UC Irvine (0) UCLA (29) UC Merced (0) UC Riverside (0) UC San Diego (0) UCSF (0) UC Santa Barbara (0) UC Santa Cruz (0) UC Office of the President (8) Lawrence Berkeley National Laboratory (1) UC Agriculture & Natural Resources (0)

## Department

Department of Statistics, UCLA (12) Biostatistics (11) Research Grants Program Office (RGPO) (8) Department of Molecular and Medical Pharmacology, UCLA (1) School of Public Health (1)

## Journal

## Discipline

Physical Sciences and Mathematics (4) Life Sciences (1) Medicine and Health Sciences (1)

## Reuse License

BY-NC-SA - Attribution; NonCommercial use; Derivatives use same license (2)

## Scholarly Works (29 results)

Affymetrix’s SNP (single nucleotide polymorphism) genotyping chips have increased the scope and decreased the cost of gene mapping studies. Because each SNP is queried by mul- tiple DNA probes, the chips present interesting challenges in genotype calling. Traditional clustering methods distinguish the three genotypes of a SNP fairly well given a large enough sample of unrelated individuals or a training sample of known genotypes. The present pa- per describes our attempt to improve genotype calling by constructing Gaussian penetrance models with empirically derived priors. The priors stabilize parameter estimation and borrow information collectively gathered on tens of thousands of SNPs. When data from related family members are available, Gaussian penetrance models capture the correlations in signals between relatives. With these advantages in mind, we apply the models to Affymetrix probe intensity data on 10,000 SNPs gathered on 63 genotyped individuals spread over eight pedigrees. We integrate the genotype calling model with pedigree analysis and examine a sequence of sym- metry hypotheses involving the correlated probe signals. The symmetry hypotheses raise novel mathematical issues of parameterization. Using the BIC criterion, we select the best combi- nation of symmetry assumptions. Compared to the genotype calling results obtained from Affymetrix’s software, we are able to reduce the number of no-calls substantially and quan- tify the level of confidence in all calls. Once pedigree analysis software can accommodate soft penetrances, we can expect to see more reliable association and linkage studies with less wasted genotyping data.

Quadratic ma jorizations for real-valued functions of a real variable are analyzed, and the concept of sharp ma jorization is introduced and studied. Applications to logistic and robust loss functions are discussed.

We propose a dictionary model for haplotypes. According to the model, a haplotype is con- structed by randomly concatenating haplotype segments from a given dictionary of segments. A haplotype block is defined as a set of haplotype segments that begin and end with the same pair of markers. In this framework, haplotype blocks can overlap, and the model provides a setting for testing the accuracy of simpler models invoking only nonoverlapping blocks. Each haplotype segment in a dictionary has an assigned probability and alternate spellings that ac- count for genotyping errors and mutation. The model also allows for missing data, unphased genotypes, and prior distribution of parameters. Likelihood evaluations rely on forward and backward recurrences similar to the ones encountered in hidden Markov models. Parameter estimation is carried out with an EM algorithm. The search for the optimal dictionary is a particularly difficult because of the variable dimension of the model space. We define a mini- mum description length criteria to evaluate each dictionary and use a combination of greedy search and careful initialization to select a best dictionary for a given data set. Application of the model to simulated data gives encouraging results. In a real data set, we are able to reconstruct a parsimonious dictionary that captures patterns of linkage disequilibrium well.

With the advent of massively parallel high-throughput sequencing, geneticists have the technology to answer many problems. What we lack are analytical tools. As the amount of data from these sequencers continues to overwhelm much of the current analytical tools, we must come up with more efficient methods for analysis. One potentially useful tool is the MM, majorize-minimize or minorize-maximize, algorithm.

The MM algorithm is an optimization method suitable for high-dimensional problems. It can avoid large matrix inversions, linearize problems, and separate parameters. Additionally it deals with constraints gracefully and can turn a non-differentiable problem into a smooth one. These benefits come at the cost of iteration.

In this thesis we apply the MM algorithm in the optimization of three problems. The first problem we tackle is an extension of random graph theory by Erdos. We extend the model by relaxing two of the three underlying assumptions, namely any number of edges can form between two nodes and edges form with a Poisson probability with mean dependent on the two nodes. This is aptly named a random multigraph.

The next problem extends random multigraphs to include clustering. As before, any number of edges can still form between two nodes. The difference is now the number of edges formed between two nodes is Poisson distributed with mean dependent on the two nodes along with their clusters.

For our last problem we place individuals onto the map using their genetic information. Using a binomial model with a nearest neighbor penalty, we estimate allele frequency surfaces for a region. With these allele frequency surfaces, we calculate the posterior probability that an individual comes from a location by a simple application of Bayes' rule and place him at his most probable location. Furthermore, with an additional model we estimate admixture coefficients of individuals across a pixellated landscape.

Each of these problems contain an underlying optimization problem which is solved using the MM algorithm. To demonstrate the utility of the models we applied them to various genetic datasets including POPRES, OMIM, gene expression, protein-protein interactions, and gene-gene interactions. Each example yielded interesting results in reasonable time.

The advent of the Big Data era has spawned intense interest in scalable mathematical optimization methods. Traditional approaches such as Newton’s method fall apart whenever the features outnumber the examples in a data set. Consequently, researchers have intensely developed first-order methods that rely only on gradients and subgradients of a cost function.

In this dissertation we focus on projected gradient methods for large-scale con-

strained optimization. We develop a particular case of a proximal gradient method

called the proximal distance algorithm. Proximal distance algorithms combine the

classical penalty method of constrained minimization with distance majorization. To

optimize the loss function $f(x)$ over a constraint set $C$, the proximal distance principle mandates minimizing the penalized loss $f(x) + \rho \mathrm{dist} \; (x,C)^2$ and following the solution $x_{\rho}$ to its limit as $\rho \to \infty$. At each iteration the squared Euclidean distance $\mathrm{dist} \; (x, C)^2$ is majorized by $\| x − \Pi_{C}(x_k) \|_2^2$, where $\Pi_{C}(x_k)$ denotes the projection of the current iterate $x_k$ onto $C$. The minimum of the surrogate function $f(x) + \rho \| x − \Pi_{C} (x_k) \|_2^2$ is given by the proximal map $\mathrm{prox}_{ρ^{−1}} \; f [ \Pi_{C} (x_k )]$. The next iterate $x_{k+1}$ automatically decreases the original penalized loss for fixed $\rho$. Since many explicit projections and proximal maps are known in analytic or computable form, the proximal distance algorithm provides a scalable computational framework for a variety of constraints.

For the particular case of sparse linear regression, we implement a projected gradient algorithm known as iterative hard thresholding for a particular large-scale genomics analysis known as a genome-wide association study. A genome-wide association study (GWAS) correlates marker variation with trait variation in a sample of individuals. Each study subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here we assume that subjects are unrelated and collected at random and that trait values are normally distributed or transformed to normality. Over the past decade, researchers have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with LASSO or MCP penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage desktop workstations in GWAS analysis and to eschew expensive supercomputing resources. We evaluate IHT performance on both simulated and real GWAS data and conclude that it reduces false positive and false negative rates while remaining competitive in computational time with penalized regression.