Skip to main content
eScholarship
Open Access Publications from the University of California

Handling Incomplete High-Dimensional Multivariate Longitudinal Data with Mixed Data Types by Multiple Imputation Using a Longitudinal Factor Analysis Model

  • Author(s): Lu, Xiang
  • Advisor(s): Belin, Thomas R
  • et al.
Abstract

We developed an imputation model solving the missing-data problem in a high-dimensional longitudinal data set with mixed data types (continuous and ordinal) based on a factor-analysis and a linear mixed-effect model. Markov Chain Monte Carlo is used to fit the model, drawing parameters, latent variables and missing values iteratively. The imputation model is written in an R package.

We tested the newly developed imputation model using simulated data sets under 32 scenarios and 2 hypothetical missing-data mechanisms. Two competitive models PAN (Multiple Imputation for Multivariate Panel or Clustered Data) and MICE (Multiple Imputation using Chained Equations) are also tested in the same way for comparison, to show the necessity of addressing the high-dimension and mixed continuous and ordinal data type issues.

Part of the effort we made is to accelerate the simulation using C++ (a low-level language) and the parallel computing by the Hoffman 2 Cluster. Compared to running the simulation evaluation in an R program on one single computer, the program we use for the simulation evaluation runs approximately 600 times faster.

We also tested the robustness of the newly developed imputation model in the cases of violation of assumptions. We found that assuming less than the true number of factors corresponds to invalid inferences, while assuming more than that corresponds to reasonable inferences. We also found that only omitting very strong underlying quadratic trends of the factor scores hurt the inferences based on the imputation. In the most unfavorable scenario we tested, when the underlying quadratic coefficient is as large as .8 of the linear coefficient, the actual coverage rates of 95% interval estimates start falling below 90%.

An application to a dentistry data is shown, in comparison to the PAN, NORM and a fore runner of the newly developed method.

Main Content
Current View