- Shishegar, Rosita;
- Cox, Timothy;
- Rolls, David;
- Bourgeat, Pierrick;
- Doré, Vincent;
- Lamb, Fiona;
- Robertson, Joanne;
- Laws, Simon M;
- Porter, Tenielle;
- Fripp, Jurgen;
- Tosun, Duygu;
- Maruff, Paul;
- Savage, Greg;
- Rowe, Christopher C;
- Masters, Colin L;
- Weiner, Michael W;
- Villemagne, Victor L;
- Burnham, Samantha C
To improve understanding of Alzheimer's disease, large observational studies are needed to increase power for more nuanced analyses. Combining data across existing observational studies represents one solution. However, the disparity of such datasets makes this a non-trivial task. Here, a machine learning approach was applied to impute longitudinal neuropsychological test scores across two observational studies, namely the Australian Imaging, Biomarkers and Lifestyle Study (AIBL) and the Alzheimer's Disease Neuroimaging Initiative (ADNI) providing an overall harmonised dataset. MissForest, a machine learning algorithm, capitalises on the underlying structure and relationships of data to impute test scores not measured in one study aligning it to the other study. Results demonstrated that simulated missing values from one dataset could be accurately imputed, and that imputation of actual missing data in one dataset showed comparable discrimination (p < 0.001) for clinical classification to measured data in the other dataset. Further, the increased power of the overall harmonised dataset was demonstrated by observing a significant association between CVLT-II test scores (imputed for ADNI) with PET Amyloid-β in MCI APOE-ε4 homozygotes in the imputed data (N = 65) but not for the original AIBL dataset (N = 11). These results suggest that MissForest can provide a practical solution for data harmonization using imputation across studies to improve power for more nuanced analyses.