Complementarity In Data Mining
- Author(s): Chang, Kung-Hua
- Advisor(s): Parker, D. Stott
- et al.
A learning problem involving classifiers and features usually has three components: representation, evaluation, and optimization. Contemporary research represents classifiers and features as initially given, and these are then usually ranked based on their classification accuracy or information gain. Similarly, the evaluation methods of contemporary ensemble/feature selection algorithms have then been based on overall classification accuracy of the combined set of classifiers or features. As with optimization of classifiers, the algorithms used in finding the best ensemble/feature set such as forward/backward stepwise selection/pruning, select the set having the best classification accuracy. This selection process does not evaluate any individual classifier's (or feature's) ability to generalize, and is analogous to blindly searching for the best combination.
Our research addresses the ensemble/feature selection problem based on
complementarity, a new approach that represents different assumptions about accuracy, diversity, and generalization than contemporary methods. In this approach, we represent each classifier (feature) with its coverage of the training set –– as a 0-1 vote vector showing which examples in the training set are correctly classified. Through N-fold cross validation on the training set, we identify complementary classifiers/features that generalize, and then evaluate the effectiveness of combinations of classifiers and features based on coverage of training/cross-validated examples. To be more specific, in our incremental ensemble selection method, complementarity is used to identify a classifier that maximally improves the current ensemble's vote patterns. An important aspect of this method for efficient optimization is that maximally incorrect vote patterns have the highest priority.
Thus, the optimization of ensemble/feature selection is based on selecting classifiers/features that cover (improve votes on) as many wrongly classified examples as possible in each step. This approach differs from the contemporary evaluation and optimization methods based on training/validation accuracy as our methods prioritize coverage of minority examples over classification accuracy. We refer to this approach as complementarity because the resulting ensemble/feature set balances coverage and generalizability. Our experiments show good results for widely-studied benchmark datasets.