Longitudinal studies play a prominent role in health, social, and behavioral sciences as well as in the biological sciences, economics, and marketing. By following subjects over time, temporal changes in an outcome of interest can be directly observed and studied. An important question concerns the existence of distinct trajectory patterns. One way to discover potential patterns in the data is through cluster analysis, which seeks to separate objects (individuals, subjects, patients, observational units) into homogeneous groups. There are many ways to cluster multivariate data. Most methods can be categorized into one of two approaches: nonparametric and model-based methods. The first approach makes no assumptions about how the data were generated and produces a sequence of clustering results indexed by the number of clusters k=2,3,... and the choice of dissimilarity measure. The later approach assumes data vectors are generated from a finite mixture of distributions. The bulk of the available clustering algorithms are intended for use on data vectors with exchangeable, independent elements and are not appropriate to be directly applied to repeated measures with inherent dependence.
Multivariate Gaussian mixtures are a class of models that provide a flexible parametric approach for the representation of heterogeneous multivariate outcomes. When the outcome is a vector of repeated measurements taken on the same subject, there is often inherent dependence between observations. However, a common covariance assumption is conditional independence---that is, given the mixture component label, the outcomes for subjects are independent. In Chapter 2, I study, through asymptotic bias calculations and simulation, the impact of covariance misspecification in multivariate Gaussian mixtures. Although maximum likelihood estimators of regression and prior probability parameters are not consistent under misspecification, they have little asymptotic bias when mixture components are well separated or if the assumed correlation is close to the truth even when the covariance is misspecified. I also present a robust standard error estimator and show that it outperforms conventional estimators in simulations and can provide evidence that the model is misspecified.
The main goal of a longitudinal study is to observed individual change over time; therefore, observed trajectories have two prominent features: level and shape of change over time. These features are typically associated with baseline characteristics of the individual. Grouping by shape and level separately provides an opportunity to detect and estimate these relationships. Although many nonparametric and model-based methods have been adapted for longitudinal data, most fail to explicitly group individuals according to the shape of their repeated measure trajectory. Some methods are thought to group by shape, but the dissimilarity between trajectories is not defined in terms of any one specific feature of the data. Rather, the methods are based on the entire vector and cluster trajectories by the level because it tends to dominate the variability between data vectors. These methods discover shape groups only if level and shape are correlated.
To fulfill the need for clustering based explicitly on shape, I propose three methods Chapter 4 that are adaptations of available algorithms. One approach is to use a dissimilarity measure based on estimated derivatives of functions underlying the trajectories. One challenge for this approach is estimating the derivatives with minimal bias and variance. The second approach explicitly models the variability in the level within a group of similarly shaped trajectories using a mixture model resulting in a multilayer mixture model. One difficulty with this method comes in choosing the number of shape clusters. Lastly, vertically shifting the data by subtracting the subject-specific mean directly removes the level prior to modeling. This non-invertible transformation can result in singular covariance matrixes, which makes parameter estimation difficult. In theory, all of these methods should cluster based on shape, but each method has shortfalls. I compare these methods with existing clustering methods in a simulation study in Chapter 5 and find that the vertical shifted mixture model outperforms the existing and other proposed methods.
A subset of the clustering methods are then compared on a real data set of childhood growth trajectories from the Center for the Health Assessment of Mothers and Children of Salinas (CHAMACOS) study in Chapter 6. Vertically shifting the data prior to fitting a mixture model results in groups based on the shape of their growth over time in contrast to the standard mixture model assuming either conditional independence or a more general correlation. The group means do not drastically change between methods for this data set, but group membership differs enough to impact inference about the relationship between baseline covariates and distinct groups.