Skip to main content
eScholarship
Open Access Publications from the University of California

UC Davis

UC Davis Electronic Theses and Dissertations bannerUC Davis

Change-point Detection for Modern Data

Abstract

Change-point detection investigates whether there are abrupt changes in distributions in sequences of observations. The goal is to partition a sequence of observations into homogeneous subsequences, which provides essential screening information for follow-up studies. As we enter the era of big data, it is commonplace to encounter sequences of high-dimensional/non-Euclidean observations. Parametric methods are often limited to those data where the parametric assumptions are reasonable. Nonparametric methods are usually more broadly applicable, but it is often hard to conduct theoretical analysis on them, such as to provide an analytic $p$-value approximation to facilitate the application to large datasets. The graph-based framework, which utilizes the edge-count information on the similarity graphs constructed on the observations, is the first kind that can be applied to these data with analytic $p$-value approximates. In this dissertation, we work out three advancements of the graph-based framework to meet the needs for modern data analysis. First, we improve the time efficiency of the algorithms by incorporating the approximate directed $k$-Nearest Neighbor ($k$-NN) graphs into the framework. Our new method is many folds faster to run and has power higher than or competitive with state-of-the-art nonparametric methods under various settings. The effectiveness of the new method is illustrated by real applications to fMRI and Neuropixels data sequences. Second, when data are autocorrelated, existing methods that assume independence could result in a higher false discovery rate. Therefore, we use the circular block permutation (CBP) framework that preserves the locally dependent structure among observations. The new framework provides proper controls on the false discovery rate when data have weak serial correlations. Third, we investigate the problem of multiple sequences of high-dimensional/non-Euclidean observations. We propose a new scan statistic that is powerful in detecting changes in all or a subset of the sequences. The new test has much higher power than existing methods and is sensitive to a wide range of alternatives. We illustrate the performance of our new test by applying to the New York Taxi Data over multiple calendar years. For all three new tests, we derive analytic formulas for $p$-value approximations to make them fast applicable.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View