Soloff, Jake

Shape-constrained estimation for modern statistical problems

2022

Abstract

Shape constraints encode a relatively weak form of prior information specifying the direction of certain relationships in an unknown signal. Classical examples include estimation of a convex function or a monotone density. Shape constraints are often strong enough to dramatically reduce statistical complexity while still yielding flexible, nonparametric estimators. This thesis brings shape constraints to bear on several recent research areas in statistics—distribution-free inference, high-dimensional covariance estimation, empirical Bayes, and multiple hypothesis testing.

Chapter 2 discusses my joint work with Professor Aditya Guntuboyina and Professor Jim Pitman on distribution-free properties of isotonic regression. In this work, we establish a distributional result for the components of the isotonic least squares estimator using its characterization as the derivative of the greatest convex minorant of a random walk. Provided the walk has exchangeable increments, we prove that the slopes of the greatest convex minorant are distributed as order statistics of the running averages. This result implies an exact formula for the squared error risk of least squares in homoscedastic isotonic regression when the true sequence is constant that holds for every exchangeable error distribution.

Chapter 3 discusses my joint work with Professor Aditya Guntuboyina and Professor Michael I. Jordan on sign-constrained precision matrix estimation. We investigate the problem of high-dimensional covariance estimation under the constraint that the partial correlations are nonnegative. The sign constraints dramatically simplify estimation: the Gaussian maximum likelihood estimator is well defined with only two observations regardless of the number of variables. We analyze its performance in the setting where the dimension may be much larger than the sample size. We establish that the estimator is both high-dimensionally consistent and minimax optimal in the symmetrized Stein loss. We also prove a negative result which shows that the sign-constraints can introduce substantial bias for estimating the top eigenvalue of the covariance matrix.

Chapter 4 discusses my joint work with Professor Aditya Guntuboyina and Professor Bodhisattva Sen on nonparametric empirical Bayes with multivariate, heteroscedastic Gaussian errors. Multivariate, heteroscedastic errors complicate statistical inference in many large- scale denoising problems. Empirical Bayes is attractive in such settings, but standard para- metric approaches rest on assumptions about the form of the prior distribution which can be hard to justify and which introduce unnecessary tuning parameters. We extend the nonparametric maximum likelihood estimator (NPMLE) for Gaussian location mixture densities to allow for multivariate, heteroscedastic errors. NPMLEs estimate an arbitrary prior by solving an infinite-dimensional, convex optimization problem; we show that this convex optimization problem can be tractably approximated by a finite-dimensional version. We introduce a dual mixture density whose modes contain the atoms of every NPMLE, and we leverage the dual both to establish non-uniqueness in multivariate settings as well as to construct explicit bounds on the support of the NPMLE.

The empirical Bayes posterior means based on an NPMLE have low regret, meaning they closely target the oracle posterior means one would compute with the true prior in hand. We prove an oracle inequality implying that the empirical Bayes estimator performs at nearly the optimal level (up to logarithmic factors) for denoising without prior knowledge. We provide finite-sample bounds on the average Hellinger accuracy of an NPMLE for estimating the marginal densities of the observations. We also demonstrate the adaptive and nearly- optimal properties of NPMLEs for deconvolution. We apply the method to two astronomy datasets, constructing a fully data-driven color-magnitude diagram of 1.4 million stars in the Milky Way and investigating the distribution of chemical abundance ratios for 27 thousand stars in the red clump.

Chapter 5 discusses my joint work with Daniel Xiang and Professor William Fithian on finite-sample control of the maximum local false discovery rate in multiple hypothesis testing. Despite the popularity of the false discovery rate (FDR) as an error control metric for large- scale multiple testing, its close Bayesian counterpart the local false discovery rate (lfdr), defined as the posterior probability that a particular null hypothesis is false, is a more directly relevant standard for justifying and interpreting individual rejections. However, the lfdr is difficult to work with in small samples, as the prior distribution is typically unknown. We propose a simple multiple testing procedure and prove that it controls the expectation of the maximum lfdr across all rejections; equivalently, it controls the probability that the rejection with the largest p-value is a false discovery. Our method operates without knowledge of the prior, assuming only that the p-value density is uniform under the null and decreasing under the alternative. We also show that our method asymptotically implements the oracle Bayes procedure for a weighted classification risk, optimally trading off between false positives and false negatives. We derive the limiting distribution of the attained maximum lfdr over the rejections, and the limiting empirical Bayes regret relative to the oracle procedure.

Main Content

For improved accessibility of PDF content, download the file to your device.

UC Berkeley

Shape-constrained estimation for modern statistical problems