Fuzzy Forests: Extending Random Forests for Correlated, High-Dimensional Data
In this paper we introduce fuzzy forests, a novel machine learning algorithm for ranking
the importance of features in high-dimensional classication and regression problems.
Fuzzy forests is specically designed to provide relatively unbiased rankings of variable
importance in the presence of highly correlated features, especially when p >> n . We
introduce our implementation of fuzzy forests in the R package, fuzzyforest . Fuzzy forests
works by taking advantage of the network structure between features. First, the features
are partitioned into separate modules such that the correlation within modules is high
and the correlation between modules is low. The package fuzzyforest allows for easy use
of Weighted Gene Coexpression Network Analysis (WGCNA) to form modules of features
such that the modules are roughly uncorrelated. Then recursive feature elimination random
forests (RFE-RFs) are used on each module, separately. From the surviving features,
a nal group is selected and ranked using one last round of RFE-RFs. This procedure
results in a ranked variable importance list whose size is pre-specied by the user. The
selected features can then be used to construct a predictive model.