Small Sample Inference
- Author(s): Gerlovina, Inna
- Advisor(s): Hubbard, Alan E
- et al.
Multiple comparisons and small sample size, common characteristics of many types of "Big Data" including those that are produced by genomic studies, present specific challenges that affect reliability of inference. Use of multiple testing procedures necessitates estimation of very small tail probabilities and thus approximation of distal tails of a test statistic distribution. Results based on large deviation theory provide a formal condition that is necessary to guarantee error rate control given practical sample sizes, linking the number of tests and the sample size; this condition, however, is rarely satisfied. Using methods that are based on Edgeworth expansions (relying especially on the work of Peter Hall), we explore what it might translate into in terms of actual error rates. Our investigation illustrates how far the actual error rates can be from the declared nominal levels, indicating poor error rate control.
Edgeworth expansions, providing higher order approximations to the sampling distribution, also offer a promising direction for data analysis that could ameliorate the situation. In Chapter 1, we derive generalized expansions for studentized mean-based statistics that incorporate ordinary and moderated one- and two-sample t-statistics as well as Welch t- test. Fifth-order expansions are generated with our developed software that can be used to produce expansions of an arbitrary order. In Chapter 2, we propose a data analysis method based on these expansions that includes tail diagnostic procedure and small sample adjustment. Using the software algorithm developed for generating expansions, we also obtain results for unbiased moment estimation of a general order. Chapter 3 introduces a general linear combination (GLC) bootstrap, which is specifically tailored for small sample size. A stabilized variance version of GLC bootstrap, based on empirical Bayes approach, is developed for high-dimensional data. Applying these methods to clustering, we propose an inferential procedure that produces pairwise clustering probabilities.