- Main
A New Ranking Scheme for High-dimensional and Non-Euclidean Data with Applications in Hypothesis Testing and Change-point Detection
- Zhou, Doudou
- Advisor(s): Chen, Hao
Abstract
Due to their robustness and efficiency, rank-based approaches are among the most popular nonparametric methods for univariate data in tackling statistical problems such as hypothesis testing and change-point detection. However, they are less explored for more complex data. In the era of big data, high-dimensional and non-Euclidean data, such as networks and images, are ubiquitous and pose challenges for statistical analysis. Existing multivariate ranks such as spatial rank, Mahalanobis rank, and component-wise rank do not apply to high-dimensional or non-Euclidean data. This dissertation tackles the problem by proposing novel rank functions applicable to complex data and applying them to the two-sample testing and change-point detection problem. Instead of dealing with the ranks of observations, we propose two types of ranks based on the observations' similarity graph: the graph-induced rank defined by the inductive nature of the graph and the overall rank defined by the weight of edges in the graph. These two new ranks are used to construct two sets of statistics for hypothesis testing and change-point detection.
For two-sample testing, we prove that the new test statistic converges to the $\chi_2^2$ distribution under the permutation null distribution, enabling an easy type-I error control. The new method exhibits good power under a wide range of alternatives compared to existing methods, as shown in simulations. The new test is illustrated on the New York City taxi data for comparing travel patterns in consecutive months and a brain network dataset comparing male and female subjects.
The graph-induced rank is further used to construct scan statistics for the change-point problem. We prove the proposed scan statistics are asymptotically distribution-free under the null hypothesis and derive the analytic $p$-value approximation. Simulation studies show that the new method works well for various changes and is robust to heavy-tailed distributions and outliers. The new method is illustrated by detecting seizures in a functional connectivity network dataset and travel pattern changes in the New York City taxi dataset.
Main Content
Enter the password to open this PDF file:
-
-
-
-
-
-
-
-
-
-
-
-
-
-