Skip to main content
eScholarship
Open Access Publications from the University of California

Sequential Bayesian Regression for Multiple Imputation and Conditional Editing

  • Author(s): Jeffries, Robin Angela
  • Advisor(s): Weiss, Robert E
  • et al.
Abstract

Analysts faced with errors in data apply editing rules to fix erroneous data. These edits are deterministically assigned and edits may not be correct in all cases. This dissertation presents a unified method to multiply impute missing data and multiply edit erroneous data using a sequence of Bayesian regression models. The techniques used to multiply edit erroneous data are an exact parallel for multiple imputation used to correct missing data. The models presented allow for different data types subject to several error mechanisms.

This method is called Sequential Bayesian Regression for Multiple Imputation and Conditional Editing (SyBRMICE) and creates multiple fully imputed and edited data sets. Desired analyses are performed on each complete and consistently edited and imputed data set individually. Results from these analyses are combined using the same combining rules used in multiple imputation. The resulting parameter estimates and intervals will then correctly account for the errors incurred in both the data editing and imputation processes.

Development of SyBRMICE was motivated by data from Project Connect (PC). Project Connect was an 8 year longitudinal intervention study aiming to reduce teen pregnancy and STD rates in select middle and high schools in the Los Angeles area. Survey data was collected annually to measure the effectiveness of the interventions. A paper survey was administered to the students as a group in the classroom, and student responses have both missing and erroneous data.

The Project Connect survey was administered annually for five years. A subset of students participated in multiple years resulting in repeated answers to the same question by the same student. Data errors found in the PC survey data can be categorized as belonging to one of several error types. If a variable such as gender that should remain constant over time is observed to differ across surveys, this variable then is said to have an inconsistent longitudinal response. If a variable, such as age or ever having sexual intercourse, that should increase monotonically over time is observed to have a non-monotonic reporting pattern, this variable is then said to have an inconsistent monotonic longitudinal response. Lastly if the responses to two or more related variables give conflicting information, these variables are said to have an inconsistent multiple response.

Models to stochastically edit each of the three types of erroneous data are presented. The inconsistent repeated measures, inconsistent monotone longitudinal, and inconsistent multivariate models are developed separately and then combined as steps in an example of the larger unifying SyBRMICE procedure. The examples demonstrate the flexibility and customizability of the SyBRMICE procedure. Results from an analysis performed on the multiple complete and consistent data sets generated by the SyBRMICE procedure are compared to results from the same analysis performed on a single deterministically-edited, complete-case data set.

Main Content
Current View