Clustering and Registration of Functional Data with Applications in Time Course Genomics Data
- Author(s): Zhang, Yafeng
- Advisor(s): Telesca, Donatello
- Horvath, Steve
- et al.
Functional data analysis aims to provide statistical inference for stochastic processes defined over a functional space. Typical data sources, often modeled using functional data analytic techniques, include: nonlinear longitudinal data in biomedicine, image and spatial data, space-time data, etc. This dissertation will be mainly concerned with the analysis of data arising from bio-molecular processes evolving over time. Specifically, we will consider functional data conceptualized as random curves defined over a time domain. Two important techniques used in the analysis of functional data are clustering and registration. Functional data clustering aims to identify subgroups of curves with similar shapes and estimate representative mean curves in each cluster. When applied to time course genomics data, functional data clustering identifies clusters of genes sharing similar temporal profiles. These clusters are likely to consist of genes involved in the same biological processes and functions. Functional data registration methods align curves exhibiting phase variability (e.g. variation among timings of features of different curves). After alignment, a common shape function can be estimated consistently to represent the overall pattern shared by all curves. However, when curves show both systematic shape differences and phase variability, neither functional data registration nor clustering alone is appropriate for data analysis. Motivated by applications in time course genomics data, we propose a joint model for functional data clustering and registration. The proposed method integrates reproducing representations of functions in the framework on Dirichlet process mixtures. Simulation and case studies on real datasets show that our model is able to correctly cluster and register curves simultaneously. We explore several methodological alternatives in both synthetic and case study scenarios and show that jointly accounting for registration and clustering produces more accurate and interpretable inference.