This dissertation addresses two statistical problems of dealing with noisy data with the aid of additional knowledge. My purpose is to highlight that in the era of big data, there is an increasing number of complicated problems with low signal-to-noise ratio, which cannot be simply solved by existing statistical or machine learning methods. For instance, biological data is notorious for its limited sample size but a substantial number of features (a typical p ≫ n problem). Fortunately, there is always additional knowledge from experts or insights that can be employed to devise smart methods to tackle these noisy data.
Chapter 2 discusses my work supervised by Professor Haiyan Huang on the hierarchical multi-label classification. This project is motivated by automatic disease diagnosis, where we aim to predict the patient’s status with limited samples in each disease. The structural information that depicts the relationship between diseases can mitigate the low signal-to-noise-ratio issue. We introduce a new statistic called multidimensional-local-precision-rate (mLPR) for each object in each class. We show that classification decisions made by simply sorting objects across classes, in the descending order of mLPRs, can in theory ensure the class hierarchy and meanwhile leading to the maximization of CATCH, a pre-defined performance metric related to the area under a hit curve. In practical implementation, we need to estimate mLPRs from data. Ranking the objects across classes in the descending order of estimated mLPRs, however, would not ensure the optimization of CATCH and/or the class hierarchy anymore. In response to this, we introduce a new ranking algorithm called HierRank, which optimizes an empirical version of CATCH defined based on the estimated mLPRs. The ranking results from HierRank are ensured to satisfy the hierarchical constraint. The superior performance of our approach over state-of-art methods in literature is demonstrated with a synthetic dataset and two real datasets.
Chapter 3 discusses my work supervised by Professor Peter J. Bickel on the binomial mixture model with the U-shape constraint under the regime that the binomial size m can be relatively large compared to the sample size n. This project is motivated by the GeneFishing method (Liu et al., 2019), whose output is a combination of the parameter of interest and the subsampling noise. To tackle the noise in the output, we utilize the observation that the density of the output has a U shape and model the output with the binomial mixture model under a U shape constraint. We first analyze the estimation of the underlying distribution F in the binomial mixture model under various conditions for F. Equipped with these theoretical understandings, we propose a simple method Ucut to identify the cutoffs of the U shape and recover the underlying distribution based on the Grenander estimator. It has been shown that when m = Ω(n), the identified cutoffs converge at the rate O(n^{−1/3}). The L1 distance between the recovered distribution and the true one decreases at the same rate. To demonstrate the performance, we apply our method to varieties of simulation studies, a GTEX dataset used in (Liu et al., 2019) and a single cell dataset from Tabula Muris.