Unbalanced Data Classification Using Support Vector Machines with Active Learning on Scleroderma Lung Disease Patterns
- Author(s): Lee, Jongyoon
- Advisor(s): Wu, YingNian
- et al.
Unbalanced data classification has been a long standing issue in the field of Medical Vision Science and Pattern Recognition. In our research, we introduced the methods of Support Vector Machines (SVM) with Active Learning (AL) to improve prediction of unbalanced classes of Scleroderma Lung Disease, Lung Fibrosis (LF) and Honeycomb (HC). Four different SVM with AL approaches are proposed: 1) random sampling to select the initial pool to begin the AL algorithm; 2) doubling the training instances of Honeycomb to reduce the imbalance ratio before the AL algorithm; 3) a balanced pool with equal number Honeycomb instances and Lung Fibrosis instances; 4) a balanced pool of Honeycomb and Lung Fibrosis and implements balanced sampling throughout the AL algorithm. Grid pixel data of Lung Fibrosis and Honeycomb was extracted from computed tomography (CT) images of 71 patients from 13 clinical centers around the United States to produce a training set of 348 HC and 3009 LF instances and a test set of 291 HC and 2665 LF. From our research, SVM with AL using balanced sampling compared to random sampling increased the sensitivity of HC by 0.56 (0.175 vs. 0.735) and 0.47 (0.23 vs. 0.70) for the original and de-noised dataset respectively. These results show that by implementing SVM with Active Learning paired with balanced sampling, we can improve the prediction performances of unbalanced data classification.