Big Data Science: Applying Unsupervised and Supervised Machine Learning Algorithms to Predict and Differentiate Between Vulvodynia and Healthy Controls Using High Dimensional Neuroimaging Data
- Author(s): Gordon, David
- Advisor(s): Sinsheimer, Janet
- et al.
Purpose: Due to the high-dimensionality and multicollinearity of brain morphometric and functional network features, as well as the small sample size in our study, we utilized a sparse partial least squares discriminatory analysis (sPLSDA) algorithm to deal with these challenges and select a subset of the original features to explore the underlying mechanisms of vulvodynia that differentiate affected individuals from healthy controls. To the best of our knowledge, this is the first study to perform unsupervised and supervised machine learning on neuroimaging data among individuals diagnosed with vulvodynia. Methods: We used a holdout procedure and performed a random 70/30 split for both case and healthy control data. This resulted in a training set N=86 (Ncontrols=26, Ncases=60) and a test set N=37 (Ncontrols=11, Ncases=26). We computed principal component analysis (PCA), partial least squares discriminatory analysis (PLSDA), and sPLSDA, to extract and select features from the original set of features that differentiate patients with vulvodynia from healthy controls. Furthermore, we applied a 10-fold cross validation approach to split the observations into 10 sets and repeatedly train the model on 9 sets and evaluate its performance on the 10th set. Class prediction was determined using the Mahalanobis distance metric, which utilizes a majority vote algorithm. Results: The sPLSDA algorithm selected 30 features from the 2768 original features to differentiate vulvodynia from healthy controls. The specificity, sensitivity, and predictive accuracy for the sPLSDA algorithm was found to be 89%, 73%, and 86%, respectively. The most influential selected features that differentiate patients with vulvodynia from healthy controls were functional network features, specifically of the within-module degree z score and participation related coefficient metrics. Discussion: By visualizing the sPLSDA, PLSDA, and PCA algorithms, we were able to examine how each algorithm performed on the discrimination, which in turns reveals potential insight into underlying mechanisms of vulvodynia, such as the important selected features. The predictive accuracy of the sPLSDA in our study was comparable with the predictive accuracy in previous neuroimaging studies utilizing sPLSDA and support vector machines in conditions often comorbid with vulvodynia.